JP3564999B2

JP3564999B2 - Information retrieval device

Info

Publication number: JP3564999B2
Application number: JP06658598A
Authority: JP
Inventors: 正雄伊藤; 隆正小山
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1998-03-17
Filing date: 1998-03-17
Publication date: 2004-09-15
Anticipated expiration: 2018-03-17
Also published as: HK1022538A1; CN1229218A; CN1114880C; JPH11265393A

Description

【０００１】
【発明の属する技術分野】
本発明は電子化された文書データから情報検索を行なう場合において、複数の検索エンジンで構成された場合でも、高速に書誌一覧の取得が可能な情報検索装置に関するものである。
【０００２】
【従来の技術】
近年、ワープロやパソコンの普及により、大量の文書情報が蓄積され、必要に応じて文書情報を検索する文書データベースに対する関心が高まってきている。また、文書情報に対して、キーワードを付けずに文書の内容から検索する全文検索方式が注目され、インターネットのホームページ検索等で利用されている。この全文検索方式を用いた検索システムは、サーバー・クライアント、またはＷＷＷサーバと接続した形態でユーザが使用できる。このような検索システでは、１つのユーザに検索システムを専有させるのではなく、検索結果の一覧を表示する場合には数十件単位で表示することで、ユーザ要求を同時に処理する方法が取られている。更に、検索結果一覧は単に登録順に出力するのではなく、文書と検索条件の間にある基準を設けて数値化（スコア）し、得られたスコアに従って書誌一覧を順位付けしている。このようにすることで、ユーザの要望に近い検索結果を出力することができる。ここで、「数値化」とは、一律に数値化しているだけでなく、検索対象となる文章（例えば短い文章は長い文章よりも重みづける等）、単語によって重みづけをつけて数値化することも含む意味である。また、「書誌一覧」とは、文書番号だけではユーザにわかりにくいので、例えばホームページのタイトルやＵＲＬ（ＵｎｉｖｅｒｓａｌＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）などを意味する。
【０００３】
以下、従来の情報検索装置について説明する。
図８は従来の情報検索装置の構成を示すものである。図８において、８１−１，８１−２、・・・・、８１−ｎはクライアント部で、８２は通信部で、８３は検索エンジン部で、８４は索引格納部で、８５は書誌格納部である。
【０００４】
以上のように構成された情報検索装置について、以下その動作について説明する。まず、各クライアント部８１−１，８１−２、・・・・、８１−ｎでユーザからの検索要求を通信部８２に送る。通信部８２は複数のクライアント部からの検索要求を内部に格納し、検索エンジン部８３に送る。検索エンジン部８３は索引格納部８４から索引情報を読み出して高速に検索を行ない通信部８２に返す。通信部８２では検索結果件数をクライアント部８１−１，８１−２、・・・・、８１−ｎに返す。またクライアント部８１−１，８１−２、・・・・、８１−ｎから検索にヒットした書誌一覧の要求を通信部８２に送る。通信部８２は書誌一覧要求を検索エンジン部８３に送る。検索エンジン部８３は、書誌格納部８５から書誌一覧を読み出して書誌一覧の作成を行ない、通信部８２に返す。通信部８２では書誌一覧をクライアント部８１−１，８１−２、・・・・、８１−ｎに返す。
【０００５】
【発明が解決しようとする課題】
しかしながら上記の従来の構成では、格納する文書件数が数千万件を増えるような場合には、１つの計算機では検索性能の低下や、ハードディスクやメモリなどの物理的な計算機資源の制約によって限界があり、複数の計算機で対応する必要があった。しかしながら、複数の計算機で対応するには複数の検索エンジン部で対応することになり、順序付けされた書誌一覧を取得する場合において、検索エンジン間の通信負荷が大きくなるために、全体の性能が低下するという課題を有していた。
【０００６】
本発明は上記従来技術の課題を解決するもので、複数検索エンジンの構成で順序付けされた書誌一覧を取得する場合でも通信負荷を最小限にすることを目的とする。
【０００７】
【課題を解決するための手段】
この目的を達成するために本発明における情報検索装置は、第１に、少なくとも文書データの検索と書誌一覧の作成と検索結果をある基準値に従って順序付けをそれぞれ独立して行なう複数の検索エンジン部と、検索を行うための検索情報を格納する索引格納部と、書誌一覧を作成するための情報を格納する書誌格納部と、複数の検索エンジン部の検索結果の全体を順序付けする全体ソート部とを備え、全体ソート部で、検索結果の先頭から所定の順序付けられた基準値までを各検索エンジン部から取得することにより、検索結果の書誌一覧を取得することを特徴とする。上記構成によって、書誌一覧を高速に作成することができる。
【０００８】
第２に、全体ソート部は、検索結果全体の半分以降に位置づけられた書誌一覧を取得する場合に、検索エンジン部の検索結果の末尾から順序付けの基準値を取得することを特徴とする。これにより、複数の検索エンジン部で書誌一覧を取得する場合に、全体ソート部において複数の検索エンジン部の検索履歴のスコアを全て抽出するのではなく、検索結果の先頭または末尾からの取得番号と取得件数に応じて部分的にスコアを抽出することにより、高速に書誌一覧の作成することができる。
【０００９】
第３に、全体ソート部は、検索結果を順序付けの基準値でｎ分割し（ｎ≧２）、各々分割された基準値の下限値以上の件数を各検索エンジンから最初に取得して、各範囲内の件数を累計することで、目的の書誌一覧の位置を割り出すことを可能にした。これにより、スコアの件数分布を各検索エンジン部から取得して、件数分布から必要となる検索結果の位置を再度計算してスコアを部分的に取得することで、高速に書誌一覧の作成することができる。
【００１０】
【発明の実施の形態】
（実施の形態１）
以下、本発明の第１の実施例について、図面を参照しながら説明する。
【００１１】
図１は本発明の一実施例における情報検索装置の構成図である。図１において、１１−１、１１−２、・・・・、１１−ｎはクライアント部、１２は通信部、１３−１、１３−２、・・・・、１３−ｎは検索エンジン部、１４は全体ソート部、１５は高スコア記憶部、１６は索引格納部、１７は書誌格納部である。
【００１２】
以上のように構成された情報検索装置について、その動作を説明する。まず、各クライアント部１１−１、１１−２、・・・・、１１−ｎでユーザからの検索要求を通信部１２に転送し、通信部１２は検索要求がきた場合には各検索エンジン部１３に対して検索件数の要求を行ない、各検索エンジン部１３−１、１３−２、・・・・、１３−ｎは索引格納部１６から検索するための索引情報を読み出して検索し、検索結果件数を通信部１２に渡す。通信部１２は各検索エンジン部１３−１、１３−２、・・・・、１３−ｎの検索結果件数を合計してクライアント部に返す。また通信部１２はクライアント部１１−１、１１−２、・・・・、１１−ｎから書誌一覧の要求がきた場合には、全体ソート部１４に対して書誌一覧の先頭からの番号と取得する件数を送る。全体ソート部１４は各検索エンジン部１３−１、１３−２、・・・・、１３−ｎに対して検索結果の情報が格納される検索履歴中の検索要求と文書間である基準で求めた値（スコア）を（｛取得開始番号｝＋｛取得件数｝−１）件分だけ要求する。検索エンジン部１３−１、１３−２、・・・・、１３−ｎはスコアに従ってソートし、要求された件数分だけスコアを全体ソート部１４に返す。全体ソート部１４では得られた各検索エンジン部１３−１、１３−２、・・・・、１３−ｎのスコアをスコア順に並べ変え、各検索エンジン部１３−１、１３−２、・・・・、１３−ｎの開始番号と取得件数を求める。全体ソート部１４は求めた開始番号と取得件数を各検索エンジン部１３−１、１３−２、・・・・、１３−ｎに送り、各検索エンジン部１３−１、１３−２、・・・・、１３−ｎは開始番号から検索履歴の文書番号を読み出して、文書番号から書誌格納部１７から書誌内容を作成して全体ソート部１４に送る。全体ソート部１４は各検索エンジン部１３３−１、１３−２、・・・・、１３−ｎから得られた書誌の内容と、スコア順に並べ変えた情報から書誌を並べ変えることで書誌一覧を作成し通信部１２に返す。通信部１２はクライアント部１１−１、１１−２、・・・・、１１−ｎに書誌一覧を転送して処理が終了する。
【００１３】
図２は検索エンジン部１３−１、１３−２、・・・・、１３−ｎで格納されている検索結果の情報である検索履歴の例を示し、ここでは３台の検索エンジンの検索履歴を示す。２１は第１検索エンジン部の検索履歴で、２２は第２検索エンジン部の検索履歴で、２３は第３検索エンジン部の検索履歴である。それぞれの履歴は、スコアで降順にソートされている状態を示す。この検索履歴に対して、取得開始番号が１で取得件数が１０件の書誌一覧を取得する場合には、全体ソート部１４で、｛１＋１０−１＝１０｝件のスコアの取得要求を各検索エンジン部１３に送ることになり、各検索エンジン部１３−１、１３−２、・・・・、１３−ｎは上位１０件のスコアを取り出した例が、２４、２５、２６である。２４は第１検索エンジン部の１０件分のスコアを示し、２５は第２検索エンジン部の１０件分のスコアを表し、２６は第３検索エンジン部の１０件分のスコアを表す。以上の例に示すように、各検索エンジンで上位のスコアを求めることができる。
【００１４】
図３は図２のスコアを全体ソート部１４でソートした例を示す図である。
この図では３台の検索エンジンのそれぞれ１０件ずつの検索履歴を取得し、全体で３０件の検索履歴をスコア順に並べ変えたものである。この例では取得開始番号が１で取得件数が１０件なので、１〜１０番目の検索履歴がクライアント部１１−１、１１−２、・・・・、１１−ｎに返す書誌一覧になる。この検索履歴から各々の検索エンジン部１３−１、１３−２、・・・・、１３−ｎに対する開始番号と取得件数を求めた図が図３である。この例では第１検索エンジン部には開始番号１で取得件数２、第２検索エンジン部には開始番号１で取得件数４、第３検索エンジン部には開始番号１で取得件数４になる。以上の例に示すように、全体ソート部１４でスコアをソートして、各検索エンジン部１３−１、１３−２、・・・・、１３−ｎに要求するための開始番号と取得件数を求めることができる。
【００１５】
図４は図３の各検索エンジン部の開始番号と取得件数から書誌を取得し、書誌一覧を作成する過程を示す図である。
【００１６】
各検索エンジン部１３−１、１３−２、・・・・、１３−ｎで検索履歴から文書番号を求めて、書誌格納部１７から文書番号に該当する書誌内容を読み出し、全体ソート部１４に各検索エンジン部１３−１、１３−２、・・・・、１３−ｎで得られた書誌を転送する。全体ソート部１４では各々の書誌をスコア順に並び替えて書誌一覧を作成し、通信部１２に書誌一覧を返す。
【００１７】
以上の例に示すように、各検索エンジンの開始番号と取得件数から書誌内容を作成し、全体ソート部で再度並べ替えることで、書誌一覧を作成することができる。
【００１８】
以上のように本実施例によれば、複数検索エンジン部で構成された情報検索装置において、スコアなどで順序付けされた検索結果から目的の書誌一覧を取得する場合に、全体ソート部と高スコア記憶部を設けることにより、必要な検索履歴を部分的に取得するだけで、書誌一覧を高速に作成することができる。
【００１９】
なお、実施の形態１においてクライアント部と通信部と検索エンジン部と全体ソート部はつの計算機で行なってもよいし、全て別々の計算機で行なってもよい。また部分的に１つの計算機で行なってもよいものとする。
【００２０】
また、実施の形態１において通信部は各検索エンジン部の検索結果件数を保持して全体ソート部に渡すことで、全体ソート部が検索結果件数が０件の検索エンジン部に対しては、書誌の取得要求を行なわないことで、０件の検索エンジン部との通信時間を低減することができる。
【００２１】
（実施の形態２）
以下、本発明の実施の形態２について、図面を参照しながら説明する。
【００２２】
図５は本発明の一実施例における情報検索装置を示す図である。
図５において、５１−１、５１−２、・・・・、５１−ｎはクライアント部、５２は通信部、５３−１、５３−２、・・・・、５３−ｎは検索エンジン部、５６は索引格納部、５７は書誌格納部で、以上は図１の構成と同様なものである。図１の構成と異なるのは全体ソート部５４とスコア記憶部５５を、検索履歴からスコア情報を取得する場合に、スコアの高い順に取得して記憶するのではなく、取得開始番号の位置によって、スコアの高い順に取得するかスコアの低い順に取得するかを自動的に選択することができるようにした点である。
【００２３】
例えば、新聞記事が日付順に並んでいない場合であって、新しい記事を取得したい場合には、先頭から取得するよりも、末尾から取得した方が効率的に検索できる場合がある。
【００２４】
上記のように構成された情報検索装置について、以下その動作を説明する。
まず、クライアント部５１−１、５１−２、・・・・、５１−ｎでユーザからの検索要求を通信部５２に転送し、通信部５２は検索要求がきた場合には各検索エンジン部５３−１、５３−２、・・・・、５３−ｎに対して検索件数の要求を行ない、各検索エンジン部５３−１、５３−２、・・・・、５３−ｎは索引格納部５６から検索するための索引情報を読み出して検索し、検索結果件数を通信部５２に渡す。通信部５２は各検索エンジン部５３−１、５３−２、・・・・、５３−ｎの検索結果件数を合計してクライアント部５１に返す。また通信部５２は書誌一覧の要求がきた場合には、全体ソート部５４に対して書誌一覧の取得開始番号と取得件数を送る。全体ソート部５４は各検索エンジン部５３−１、５３−２、・・・・、５３−ｎに対して、全体の検索結果件数を２で割った値より取得開始番号が大きい場合には、検索履歴の末尾から（｛全体の検索結果件数｝−｛取得開始番号｝−｛取得件数｝＋２）番目で（｛全体の検索結果件数｝−｛取得開始番号｝＋１）件取得することを要求する。検索エンジン部５３−１、５３−２、・・・・、５３−ｎはスコアに従ってソートし、ソートした結果の先頭または末尾から要求された件数分だけ、スコアを全体ソート部５４に返す。全体ソート部５４では得られた各検索エンジン部５３−１、５３−２、・・・・、５３−ｎのスコアを、先頭から取得した場合は、降順にスコアに並べ替え、末尾から取得した場合は、昇順にスコアを並び替えて、各検索エンジン部５３−１、５３−２、・・・・、５３−ｎの開始番号と取得件数を求める。全体ソート部５４は求めた開始番号と取得件数を各検索エンジン部５３−１、５３−２、・・・・、５３−ｎに送り、各検索エンジン部５３−１、５３−２、・・・・、５３−ｎは開始番号から検索履歴の文書番号を読み出して、文書番号から書誌格納部５７から書誌内容を作成して全体ソート部５４に送る。全体ソート部５４は各検索エンジン部５３３−１、５３−２、・・・・、５３−ｎから得られた書誌の内容と、スコア順に並べ替えた情報から書誌を並べ変えることで書誌一覧を作成し通信部５２に返す。通信部はクライアント部５１−１、５１−２、・・・・、５１−ｎに書誌一覧を転送して処理が終了する。
【００２５】
以上のように、全体ソート部が書誌一覧の取得する位置に応じて先頭または末尾から検索履歴のスコアを取得することで、全体ソート部に転送するスコアが少なくなり、更に全体ソート部５４でソートする件数も少なくなり、より高速な書誌一覧の取得を行なうことができる。
【００２６】
なお、実施の形態２において検索履歴の末尾から取得するとしたが、検索エンジン部でのソートを降順から昇順にソートすることで、先頭から取得するようにしてもよい。
【００２７】
（実施の形態３）
以下、本発明の実施の形態３について、図面を参照しながら説明する。
【００２８】
図６は本発明の一実施例における情報検索装置を示す図である。
図６において、６１−１、６１−２、・・・・、６１−ｎはクライアント部、６２は通信部、６３−１、６３−２、・・・・、６３−ｎは検索エンジン部、６７は索引格納部、６７は書誌格納部で、以上は図１の構成と同様なものである。図１の構成と異なるのは全体ソート部６４とスコア記憶部６５とスコア分布記憶部６６であり、書誌一覧を取得する場合に全体ソート部６４でスコアと件数の分布情報を検索エンジン部６３−１、６３−２、・・・・、６３−ｎから取得して、各スコア範囲内で件数を合計（累計）することで、検索エンジン部６３−１、６３−２、・・・・、６３−ｎから取得するスコア件数を減らすことができる。
【００２９】
上記のように構成された情報検索装置について、以下その動作を説明する。
まず、クライアント部６１−１、６１−２、・・・・、６１−ｎでユーザからの検索要求を通信部６２に転送し、通信部６２は検索要求がきた場合には各検索エンジン部６３−１、６３−２、・・・・、６３−ｎに対して検索件数の要求を行ない、各検索エンジン部６３−１、６３−２、・・・・、６３−ｎは索引格納部６７から検索するための索引情報を読み出して検索し、検索結果件数を通信部６２に渡す。通信部６２は各検索エンジン部６３−１、６３−２、・・・・、６３−ｎの検索結果件数を合計してクライアント部６１に返す。
【００３０】
また、通信部６２は書誌一覧の要求がきた場合には、全体ソート部６４に対して書誌一覧の先頭からの番号と取得する件数を送る。全体ソート部６４はスコアの最大値をｍとして０〜ｍまでのスコアをｎ分割した各スコア範囲内ではスコアの下限値以上の件数を検索エンジン部６３−１、６３−２、・・・・、６３−ｎから取得するように要求する。検索エンジン部６３−１、６３−２、・・・・、６３−ｎでは検索履歴のスコアから各スコア範囲内の下限値以上の件数を求め、全体ソート部６４に送る。全体ソート部か各検索エンジンから得られたスコア分布をスコア分布記憶部６６に格納し、全体のスコア分布を作成する。これにより、書誌一覧の取得開始番号がどのスコア範囲内に位置するかわかるので、再度検索エンジン部６３−１、６３−２、・・・・、６３−ｎに対してスコアがｓ以下で、かつ（｛取得開始番号｝−｛１つ上のスコア範囲の下限スコア以上の値を持つ件数｝＋｛取得件数｝−１）件のスコアと通し番号を取得して、全体ソート部６４に送る。全体ソート部６４では得られた各検索エンジン部６３−１、６３−２、・・・・、６３−ｎのスコアをスコア順に並べ変え、各検索エンジン部６３−１、６３−２、・・・・、６３−ｎの開始番号と取得件数を求める。全体ソート部６４は求めた開始番号と取得件数を各検索エンジン部６３−１、６３−２、・・・・、６３−ｎに送り、各検索エンジン部６３−１、６３−２、・・・・、６３−ｎは開始番号から検索履歴の文書番号を読み出して、文書番号から書誌格納部６８から書誌内容を作成して全体ソート部６４に送る。全体ソート部６４は各検索エンジン部６３−１、６３−２、・・・・、６３−ｎから得られた書誌の内容と、スコア順に並べ変えた情報から書誌を並べ変えることで書誌一覧を作成し通信部６２に返す。通信部６２はクライアント部６１−１、６１−２、・・・・、６１−ｎに書誌一覧を転送して処理が終了する。
【００３１】
図７は検索エンジン部でスコア分布を作成した例を示す図である。３つの検索エンジンが検索履歴のスコアから各スコア範囲の下限値以上の件数を求めたものが７１、７２、７３である。それぞれのスコア分布は全体ソート部６４に送られ、７４に示すように各スコア範囲内で３つの検索エンジン部６３のスコアが全体ソート部６４で合計される。この図の例では、取得開始番号が５０１番目で取得件数が２０件の場合は、スコアが８００以上が４７６件であるため、５０１番目はスコアが８０１以下である。このため、全体ソート部６４は、各検索エンジンに対して、スコアが８０１以下で、４６件（５０１−４７６＋２０＋１＝４６件）のスコアと通し番号を取得する。更に、全体ソート部６４は各検索エンジンから取得したスコアを降順に並び替えて、３４件目（５０１番目−４７６件）から２０件のエンジン番号と通し番号と件数を取得することで、目的の書誌一覧を各検索エンジン部６３から取得することができる。
【００３２】
なお、実施の形態３において全体ソート部は検索履歴の分割個数をｎとしたが、検索結果件数に応じてｎを変動させてもよい。例えば検索結果件数が多い場合にはｎを大きくし、少ない場合にはｎを小さくする。またスコア範囲内の平均件数を同じにすることで、分割個数ｎを変動させてもよい。
【００３３】
また、実施の形態３において全体ソート部は検索結果件数と書誌一覧を取得する位置に応じて第１の実施例と第２の実施例の処理を組み合わせてもよい。例えば検索結果件数が１００件程度の少ない件数であれば、スコア分布を取得しないで、第１の実施例の処理方法で行なえばよい。また検索結果件数が多い場合でも先頭からの２０件程度であれば、スコア分布を取得しないで、実施の形態１の処理方法で行なえばよい。
【００３４】
また、実施の形態３において全体ソート部でスコアの最大値をｍとしたが、Ｍは検索エンジンから件数を取得すると同時にスコアの最大値を求め、それを用いてもよい。
【００３５】
また、実施の形態１において検索履歴を並べ変える基準として、検索要求と文書間の関係を数値化したスコアを用いたが、日付けなどの数値情報を用いて並べ変えてもよいものとする。この数値情報を用いることは、第２と第３の実施例でも同じように処理できることは言うまでもない。
【００３６】
【発明の効果】
以上のように本発明は、少なくとも文書データの検索と書誌一覧の作成と検索結果をある基準値に従って順序付けをそれぞれ独立して行なう複数の検索エンジン部と、検索を行うための検索情報を格納する索引格納部と、書誌一覧を作成するための情報を格納する書誌格納部と、複数の検索エンジン部の検索結果の全体を順序付けする全体ソート部とを備え、全体ソート部で、検索結果の先頭から所定の順序付けられた基準値までを各検索エンジン部から取得することにより、書誌一覧を取得するために全体ソート部と各検索エンジン部の通信量を減らし、複数検索エンジンの環境でも高速に書誌一覧を取得することができるという効果を有する。
【００３７】
また、全体ソート部は、検索結果全体の半分以降に位置づけられた書誌一覧を取得する場合に、検索エンジン部の検索結果の末尾から順序付けの基準値を取得するようにしたので、全体ソート部において複数の検索エンジン部の検索履歴のスコアを全て抽出するのではなく、検索結果の先頭または末尾からの取得番号と取得件数に応じて部分的にスコアを抽出することにより、高速に書誌一覧の作成することができるという効果を有する。
【００３８】
また、検索結果を順序付けの基準値でｎ分割し、各々分割された基準値の下限値以上の件数を各検索エンジンから取得して、これらの件数を累計することにより、スコアの件数分布を各検索エンジン部から取得して、件数分布から必要となる検索結果の位置を再度計算してスコアを部分的に取得することで、高速に書誌一覧の作成することができるという効果を有する。
【図面の簡単な説明】
【図１】本発明の実施の形態１における情報検索装置の構成図
【図２】実施の形態１における検索エンジン部の動作例を示す図
【図３】実施の形態１における全体ソート部の動作例を示す図
【図４】実施の形態１における書誌一覧作成例を示す図
【図５】本発明の実施の形態２における情報検索装置の構成図
【図６】本発明の実施の形態３における情報検索装置の構成図
【図７】実施の形態３における全体ソート部の動作例を示す図
【図８】従来の情報検索装置の構成図
【符号の説明】
１１−１１１−２１１−ｎクライアント部
１２通信部
１３−１１３−２１３−ｎ検索エンジン部
１４全体ソート部
１５高スコア記憶部
１６索引格納部
１７書誌格納部
５１−１５１−２５１−ｎクライアント部
５２通信部
５３−１５３−２５３−ｎ検索エンジン部
５４全体ソート部
５５スコア記憶部
５６索引格納部
５７書誌格納部
６１−１６１−２６１−ｎクライアント部
６２通信部
６３−１６３−２６３−ｎ検索エンジン部
６４全体ソート部
６５スコア分布記憶部
６６スコア記憶部
６７索引格納部
６８書誌格納部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information search apparatus capable of obtaining a bibliographic list at a high speed even when a plurality of search engines are used to search for information from digitized document data.
[0002]
[Prior art]
In recent years, with the spread of word processors and personal computers, a large amount of document information has been accumulated, and interest in a document database for searching document information as necessary has been increasing. Also, a full-text search method for searching the document information from the contents of the document without adding a keyword has attracted attention, and is used in a homepage search on the Internet. A search system using this full-text search method can be used by a user in a form connected to a server client or a WWW server. In such a search system, a method of processing user requests simultaneously by displaying a list of search results in units of dozens when displaying a list of search results instead of letting one user occupy the search system is adopted. ing. Further, the search result list is not simply output in the order of registration, but is digitized (score) by setting a reference between the document and the search condition, and the bibliographic list is ranked according to the obtained score. By doing so, it is possible to output a search result close to the user's request. Here, "quantification" means not only uniform numerical conversion but also numerical conversion by weighting words to be searched (for example, short sentences are weighted more than long sentences, etc.) and words. The meaning also includes. Also, the “bibliographic list” is difficult for the user to understand only by the document number, and thus means, for example, the title of a homepage or a URL (Universal Resource Locator).
[0003]
Hereinafter, a conventional information retrieval apparatus will be described.
FIG. 8 shows a configuration of a conventional information retrieval apparatus. In FIG. 8, 81-1, 81-2,..., 81-n are client units, 82 is a communication unit, 83 is a search engine unit, 84 is an index storage unit, and 85 is a bibliographic storage unit. It is.
[0004]
The operation of the information retrieval device configured as described above will be described below. First, each client unit 81-1, 81-2,..., 81-n sends a search request from the user to the communication unit 82. The communication unit 82 internally stores search requests from a plurality of client units and sends them to the search engine unit 83. The search engine unit 83 reads out the index information from the index storage unit 84, performs a high-speed search, and returns it to the communication unit 82. The communication unit 82 returns the number of search results to the client units 81-1, 81-2,..., 81-n. Also, the client unit 81-1, 81-2,..., 81-n sends to the communication unit 82 a request for a list of bibliographies hit in the search. The communication unit 82 sends a bibliography list request to the search engine unit 83. The search engine unit 83 reads the bibliographic list from the bibliographic storage unit 85, creates a bibliographic list, and returns it to the communication unit 82. The communication unit 82 returns the bibliography list to the client units 81-1, 81-2,..., 81-n.
[0005]
[Problems to be solved by the invention]
However, in the above-described conventional configuration, when the number of documents to be stored is increased by tens of millions, a single computer has a limit due to a decrease in search performance and a limitation of physical computer resources such as a hard disk and a memory. Yes, it was necessary to deal with multiple computers. However, when multiple computers are used, multiple search engines are used, and when an ordered bibliographic list is obtained, the communication load between the search engines increases, and the overall performance decreases. Had the problem of doing so.
[0006]
An object of the present invention is to solve the above-mentioned problems of the prior art, and to minimize the communication load even when an ordered bibliographic list is obtained with a configuration of a plurality of search engines.
[0007]
[Means for Solving the Problems]
In order to achieve this object, an information retrieval apparatus according to the present invention comprises, firstly, a plurality of search engine units each independently performing at least retrieval of document data, creation of a bibliographic list, and ordering of retrieval results according to a certain reference value. An index storage unit for storing search information for performing a search, a bibliographic storage unit for storing information for creating a bibliographic list, and an entire sort unit for ordering the entire search results of a plurality of search engine units. In addition, a bibliographic list of search results is obtained by obtaining, from each search engine unit, up to a predetermined ordered reference value from the beginning of the search results in the overall sort unit. With the above configuration, a bibliographic list can be created at high speed.
[0008]
Secondly, when the bibliographic list positioned after half of the entire search result is obtained, the entire sort unit obtains a reference value for ordering from the end of the search result of the search engine unit. With this, when a bibliography list is acquired by a plurality of search engine units, the entire sort unit does not extract all the scores of the search histories of the plurality of search engine units, but obtains the acquisition number from the beginning or end of the search result. By extracting a score partially according to the number of acquisitions, a bibliographic list can be created at high speed.
[0009]
Third, the entire sort unit divides the search result into n parts by the reference value for ordering (n ≧ 2), first obtains the number of records that are equal to or more than the lower limit of the divided reference values from each search engine, and By accumulating the number of cases within the range, it was possible to determine the position of the target bibliographic list. As a result, a bibliographic list can be created at high speed by obtaining the number distribution of scores from each search engine unit, calculating the position of the required search result from the number distribution again, and partially obtaining the score. Can be.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
(Embodiment 1)
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.
[0011]
FIG. 1 is a configuration diagram of an information search device according to an embodiment of the present invention. In FIG. 1, 11-1, 11-2,..., 11-n are client units, 12 is a communication unit, 13-1, 13-2,. 14 is an overall sort unit, 15 is a high score storage unit, 16 is an index storage unit, and 17 is a bibliographic storage unit.
[0012]
The operation of the information retrieval device configured as described above will be described. First, each of the client units 11-1, 11-2,..., 11-n transfers a search request from a user to the communication unit 12, and the communication unit 12 receives the search request from each of the search engine units. , 13-n read index information for search from the index storage unit 16 and perform search. The number of results is passed to the communication unit 12. The communication unit 12 sums up the number of search results of each of the search engine units 13-1, 13-2,..., 13-n and returns the total to the client unit. When a request for a bibliography list is received from the client units 11-1, 11-2,..., 11-n, the communication unit 12 obtains the number from the top of the bibliography list and obtains the number from the top of the bibliography list. Send the number to do. .., 13-n for the search engine units 13-1, 13-2,..., 13-n. The requested values (scores) are requested for ({acquisition start number} + {acquisition number} −1). The search engine units 13-1, 13-2,..., 13-n sort according to the scores, and return the scores to the overall sorting unit 14 by the requested number. In the overall sort unit 14, the obtained scores of the search engine units 13-1, 13-2,..., 13-n are rearranged in the order of scores, and the search engine units 13-1, 13-2,. ··· Find the start number of 13-n and the number of acquisitions. The whole sort unit 14 sends the obtained start number and the number of obtained records to each of the search engine units 13-1, 13-2,..., 13-n, and the search engine units 13-1, 13-2,. .., 13-n reads the document number of the search history from the start number, creates bibliographic contents from the bibliographic storage unit 17 from the document number, and sends it to the overall sorting unit 14. The whole sort unit 14 sorts the bibliography list by sorting the bibliographies from the contents of the bibliographies obtained from the search engine units 133-1, 13-2,..., 13-n and the information sorted in the score order. Create and return to communication unit 12. The communication unit 12 transfers the bibliography list to the client units 11-1, 11-2,..., 11-n, and the process ends.
[0013]
FIG. 2 shows an example of a search history that is information of search results stored in the search engine units 13-1, 13-2,..., 13-n. Is shown. 21 is a search history of the first search engine unit, 22 is a search history of the second search engine unit, and 23 is a search history of the third search engine unit. Each history indicates a state in which the history is sorted in descending order by score. When a bibliography list having an acquisition start number of 1 and the number of acquisitions of 10 is acquired for the search history, the entire sort unit 14 retrieves {1 + 10-1 = 10} score acquisition requests in each search. The search engine units 13-1, 13-2,..., 13-n extract the top ten scores from the search engine units 13, 24, 25, and 26, respectively. Reference numeral 24 denotes the score of the first search engine unit for ten searches, reference numeral 25 denotes the score of the second search engine unit for ten searches, and reference numeral 26 denotes the score of the third search engine unit for ten searches. As shown in the above example, a higher score can be obtained by each search engine.
[0014]
FIG. 3 is a diagram showing an example in which the scores of FIG.
In this figure, ten search histories of each of three search engines are obtained, and a total of 30 search histories are rearranged in the order of score. In this example, since the acquisition start number is 1 and the number of acquisitions is 10, the first to tenth search histories are bibliographic lists to be returned to the client units 11-1, 11-2,..., 11-n. FIG. 3 is a diagram in which the start numbers and the number of acquisitions for each of the search engine units 13-1, 13-2,..., 13-n are obtained from the search history. In this example, the first search engine unit has the start number 1 and the number of acquisitions is 2, the second search engine unit has the start number 1 and the acquisition number is 4, and the third search engine unit has the start number 1 and the acquisition number is 4. As shown in the above example, the scores are sorted by the overall sorting unit 14 and the start numbers and the number of acquired records for requesting the search engine units 13-1, 13-2,. You can ask.
[0015]
FIG. 4 is a diagram showing a process of acquiring a bibliography from the start number and the number of acquisitions of each search engine unit in FIG. 3 and creating a bibliography list.
[0016]
.., 13-n finds the document number from the search history, reads the bibliographic contents corresponding to the document number from the bibliographic storage unit 17, and sends the bibliographic contents to the overall sort unit 14. The bibliographies obtained by the respective search engine units 13-1, 13-2, ..., 13-n are transferred. The overall sorting unit 14 sorts the bibliographies in the order of score, creates a bibliographic list, and returns the bibliographic list to the communication unit 12.
[0017]
As shown in the above example, a bibliographic list can be created by creating bibliographic contents from the start number of each search engine and the number of acquisitions, and rearranging them again in the overall sort unit.
[0018]
As described above, according to the present embodiment, in an information search device including a plurality of search engine units, when acquiring a target bibliographic list from search results ordered by scores or the like, an overall sort unit and a high score storage By providing the section, a bibliographic list can be created at high speed only by acquiring a necessary search history partially.
[0019]
In the first embodiment, the client unit, the communication unit, the search engine unit, and the overall sorting unit may be performed by one computer, or may be performed by separate computers. In addition, it may be partially performed by one computer.
[0020]
Also, in the first embodiment, the communication unit holds the number of search results of each search engine unit and passes it to the overall sort unit, so that the overall sort unit performs bibliography on the search engine unit with zero search results. By not making an acquisition request, the communication time with zero search engine units can be reduced.
[0021]
(Embodiment 2)
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.
[0022]
FIG. 5 is a diagram showing an information search device according to one embodiment of the present invention.
In FIG. 5, 51-1, 51-2,..., 51-n are client units, 52 is a communication unit, 53-1, 53-2,. Reference numeral 56 denotes an index storage unit, and reference numeral 57 denotes a bibliographic storage unit, which has the same configuration as that of FIG. The difference from the configuration of FIG. 1 is that when the score information is acquired from the search history, the overall sorting unit 54 and the score storage unit 55 are not acquired and stored in descending order of the score, but by the position of the acquisition start number. The point is that it is possible to automatically select whether to acquire in the order of higher score or lower score.
[0023]
For example, when newspaper articles are not arranged in chronological order and a new article is to be acquired, retrieval may be more efficiently performed by acquiring from the end than acquiring from the beginning.
[0024]
The operation of the information retrieval device configured as described above will be described below.
First, the client unit 51-1, 51-2,..., 51-n transfers a search request from the user to the communication unit 52. , 53-2,..., 53-n, and the search engine units 53-1, 53-2,. The index information for the search is read out and searched, and the number of search results is passed to the communication unit 52. The communication unit 52 sums up the number of search results of each of the search engine units 53-1, 53-2,..., 53-n and returns the result to the client unit 51. When a request for a bibliography list is received, the communication unit 52 sends the bibliography list acquisition start number and the number of acquisitions to the overall sorting unit 54. If the acquisition start number is larger than the value obtained by dividing the total number of search results by 2 with respect to each of the search engine units 53-1, 53-2,. From the end of the search history, request that (｛the total number of search results｝-開始 acquisition start number｝-｛acquisition number｝ +2) (番目 the total number of search results｝-｛acquisition start number｝ +1) be acquired I do. The search engine units 53-1, 53-2,..., 53-n sort according to the scores, and return the scores to the overall sorting unit 54 for the number of requests requested from the beginning or end of the sorted result. In the overall sorting unit 54, when the obtained scores of the search engine units 53-1, 53-2,..., 53-n are obtained from the top, the scores are rearranged in descending order and obtained from the end. In this case, the scores are rearranged in ascending order, and the start numbers and the number of obtained search engines 53-1, 53-2,..., 53-n are obtained. The whole sort unit 54 sends the obtained start number and the number of obtained records to each of the search engine units 53-1, 53-2,..., 53-n, and the search engine units 53-1, 53-2,. .., 53-n reads the document number of the search history from the start number, creates bibliographic contents from the bibliographic storage unit 57 from the document number, and sends it to the overall sort unit 54. The whole sort unit 54 sorts the bibliography list from the contents of the bibliographies obtained from the respective search engine units 533-1, 53-2,... Created and returned to communication unit 52. The communication unit transfers the bibliography list to the client units 51-1, 51-2,..., 51-n, and the process ends.
[0025]
As described above, by acquiring the score of the search history from the beginning or end according to the position where the bibliography list is acquired, the overall sorting unit reduces the number of scores to be transferred to the overall sorting unit. The number of cases to be retrieved is reduced, and a faster bibliographic list can be obtained.
[0026]
Although the search history is obtained from the end of the search history in the second embodiment, the search in the search engine unit may be performed from the descending order to the ascending order to obtain the search history from the start.
[0027]
(Embodiment 3)
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.
[0028]
FIG. 6 is a diagram showing an information retrieval device according to one embodiment of the present invention.
6, 61-n are client units, 62 is a communication unit, 63-1, 63-2, ..., 63-n is a search engine unit, Reference numeral 67 denotes an index storage unit, and reference numeral 67 denotes a bibliographic storage unit. The above is the same as the configuration shown in FIG. 1 is different from the configuration of FIG. 1 in an overall sort section 64, a score storage section 65, and a score distribution storage section 66. When a bibliographic list is acquired, the overall sort section 64 obtains scores and distribution information of the number of records by the search engine section 63-. , 63-n, and by summing (accumulating) the number of cases within each score range, the search engine units 63-1, 63-2,. It is possible to reduce the number of scores obtained from 63-n.
[0029]
The operation of the information retrieval device configured as described above will be described below.
First, the client unit 61-1, 61-2,..., 61-n transfers a search request from the user to the communication unit 62. When the search request comes, the communication unit 62 , 63-2,..., 63-n, and the search engine units 63-1, 63-2,. Then, the index information for the search is read out and searched, and the number of search results is passed to the communication unit 62. The communication unit 62 sums up the number of search results of each of the search engine units 63-1, 63-2,..., 63-n and returns the result to the client unit 61.
[0030]
When a request for a bibliographic list is received, the communication unit 62 sends the number from the head of the bibliographic list and the number of records to be obtained to the overall sorting unit 64. The whole sort unit 64 determines the number of cases equal to or larger than the lower limit of the score within each score range obtained by dividing the score from 0 to m into n, with the maximum value of the score being m, by the search engine units 63-1, 63-2,. , 63-n. The search engine units 63-1, 63-2,..., 63-n obtain the number of cases equal to or more than the lower limit value in each score range from the scores of the search history, and send them to the overall sort unit 64. The score distribution obtained from the entire sort unit or each search engine is stored in the score distribution storage unit 66, and the entire score distribution is created. With this, it is possible to know in which score range the acquisition start number of the bibliographic list is located, so that the score is again smaller than or equal to s for the search engine units 63-1, 63-2, ..., 63-n. And ({acquisition start number} − {number of items having a value equal to or more than the lower limit score of the score range higher by one} + {number of acquired items} −1) and the serial number are acquired and sent to the overall sorting section 64. The overall sorter 64 rearranges the obtained scores of the search engines 63-1, 63-2,..., 63-n in the order of the scores, and searches the search engines 63-1, 63-2,. .., 63-n start number and number of acquisitions. The overall sort unit 64 sends the obtained start number and the number of obtained records to each of the search engine units 63-1, 63-2,..., 63-n, and the search engine units 63-1, 63-2,. , 63-n reads the document number of the search history from the start number, creates bibliographic contents from the bibliographic storage unit 68 from the document number, and sends the bibliographic contents to the overall sort unit 64. The overall sort unit 64 sorts the bibliography list by rearranging the bibliographies from the bibliographic contents obtained from the search engine units 63-1, 63-2, ..., 63-n and the information rearranged in the score order. Created and returned to communication unit 62. The communication unit 62 transfers the bibliography list to the client units 61-1, 61-2,..., 61-n, and the process ends.
[0031]
FIG. 7 is a diagram showing an example in which a score distribution is created by the search engine unit. The numbers 71, 72, and 73 are obtained by the three search engines obtaining the number of cases equal to or more than the lower limit of each score range from the scores of the search history. Each score distribution is sent to the overall sort section 64, and the scores of the three search engine sections 63 are summed up by the overall sort section 64 within each score range as shown at 74. In the example of this figure, when the acquisition start number is 501 and the number of acquisitions is 20, since the score is 800 or more and 476, the score of the 501st is 801 or less. Therefore, the overall sorting unit 64 acquires 46 scores (501-476 + 20 + 1 = 46) and serial numbers for each search engine with a score of 801 or less. Further, the overall sorting unit 64 sorts the scores obtained from the respective search engines in descending order, and obtains 20 engine numbers, serial numbers, and numbers from the 34th (501st-476), thereby obtaining the target bibliography. The list can be obtained from each search engine unit 63.
[0032]
In the third embodiment, the number of divisions of the search history is set to n in the overall sort unit. However, n may be changed according to the number of search results. For example, when the number of search results is large, n is increased, and when the number is small, n is decreased. The number n of divisions may be changed by making the average number of records within the score range the same.
[0033]
Further, in the third embodiment, the overall sorting unit may combine the processes of the first and second embodiments according to the number of search results and the position where the bibliographic list is obtained. For example, when the number of search results is as small as about 100, the processing may be performed by the processing method of the first embodiment without acquiring the score distribution. Even when the number of search results is large, if the number is about 20 from the top, the processing may be performed by the processing method of the first embodiment without acquiring a score distribution.
[0034]
In the third embodiment, the maximum value of the score is m in the overall sorting unit. However, M may obtain the maximum value of the score at the same time as obtaining the number of cases from the search engine, and use the maximum value.
[0035]
In the first embodiment, a score obtained by quantifying the relationship between a search request and a document is used as a criterion for rearranging the search history. However, the score may be rearranged using numerical information such as date. It is needless to say that the use of this numerical information can be processed in the same manner in the second and third embodiments.
[0036]
【The invention's effect】
As described above, the present invention stores a plurality of search engine units that independently perform at least search of document data, creation of a bibliographic list, and ordering of search results according to a certain reference value, and search information for performing the search. An index storage unit, a bibliographic storage unit for storing information for creating a bibliographic list, and an overall sort unit for ordering the entire search results of the plurality of search engine units, wherein the overall sort unit starts the search results To a predetermined ordered reference value from each search engine unit, reducing the communication volume of the entire sort unit and each search engine unit to obtain a bibliography list, and bibliography at high speed even in the environment of multiple search engines This has the effect that a list can be obtained.
[0037]
Also, when obtaining the bibliographic list positioned after half of the entire search result, the overall sort unit obtains the ordering reference value from the end of the search result of the search engine unit. Creates a bibliographic list at high speed by extracting partial scores according to the acquisition number and the number of acquisitions from the beginning or end of the search results instead of extracting all the scores of the search history of multiple search engine parts It has the effect that it can be done.
[0038]
In addition, the search result is divided into n by the reference value for ordering, the number of cases that are equal to or more than the lower limit of the divided reference values is obtained from each search engine, and the number of cases is accumulated, thereby distributing the number distribution of scores to each. The bibliographic list can be created at high speed by acquiring from the search engine unit, recalculating the position of the necessary search result from the number distribution, and partially acquiring the score.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an information search device according to a first embodiment of the present invention; FIG. 2 is a diagram illustrating an operation example of a search engine unit according to the first embodiment; FIG. FIG. 4 is a diagram showing an example. FIG. 4 is a diagram showing an example of creating a bibliographic list in the first embodiment. FIG. 5 is a configuration diagram of an information search device in the second embodiment of the present invention. FIG. 7 is a diagram showing an operation example of an overall sorting unit according to the third embodiment. FIG. 8 is a diagram showing the configuration of a conventional information search device.
11-1 11-2 11-n Client unit 12 Communication unit 13-1 13-2 13-n Search engine unit 14 Overall sort unit 15 High score storage unit 16 Index storage unit 17 Bibliographic storage unit 51-1 51-2 51 -N client unit 52 communication unit 53-1 53-2 53-n search engine unit 54 overall sort unit 55 score storage unit 56 index storage unit 57 bibliographic storage unit 61-1 61-2 61-n client unit 62 communication unit 63 -1 63-2 63-n Search engine unit 64 Overall sort unit 65 Score distribution storage unit 66 Score storage unit 67 Index storage unit 68 Bibliographic storage unit

Claims

At least a plurality of search engine units that independently search document data, create a bibliography list, and order search results according to scores, create an index storage unit that stores search information for performing a search, and create a bibliography list A bibliographic storage unit for storing information for, and an overall sorting unit for ordering the entire search results of the plurality of search engine units,
In the overall sorting unit, the scores from the beginning of the search result to the number obtained from the acquisition start number and the number of acquisitions are obtained from each of the search engine units, and the obtained scores are merged to obtain the scores from each of the search engine units. By determining the position and number of bibliographic lists to be obtained, the overall sort unit requests bibliographic contents from the search engine unit according to the determined position and number of bibliographic lists, and creates a bibliographic list. Search device.

The system according to claim 1, further comprising: a client unit that processes the search request; and a communication unit that transfers the search request from the client unit to the search engine unit and the overall sort unit and returns a search result to the client unit. 1. The information retrieval device according to 1.

The overall sort unit, when acquiring a bibliographic list positioned after half of the entire search result, the number of search results from the end of the search results of each search engine unit and the number of search results obtained from the number of search results and the acquisition start number to obtain a score from each of the search engine unit, by acquiring the bibliography list indexing the position and number of bibliographic list acquired from each of the search engine by using the score, the position and number of bibliographic list of indexing 2. The information retrieval apparatus according to claim 1, wherein the entire sort unit requests the search engine unit for bibliographic contents in accordance with the following, and creates a bibliographic list .

The overall sorting unit divides the score of the search result into n (n ≧ 2), acquires the number of cases for each divided score from each search engine, stores it in the score distribution storage unit, and accumulates the number of stored cases. By obtaining the score of the corresponding portion from each search engine and storing it in the score storage unit, merging the stored scores and calculating the position and number of bibliographic lists obtained from each search engine unit , 2. The information search apparatus according to claim 1, wherein the entire sort unit requests the search engine unit for bibliographic contents in accordance with the position and the number of cases, and creates a bibliographic list .