JP2004295797A

JP2004295797A - Information retrieval device

Info

Publication number: JP2004295797A
Application number: JP2003090394A
Authority: JP
Inventors: Hiroyuki Onuma; 宏行大沼; Yoshitaka Hamaguchi; 佳孝濱口
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2003-03-28
Filing date: 2003-03-28
Publication date: 2004-10-21

Abstract

<P>PROBLEM TO BE SOLVED: To considerably reduce the possibility of providing incorrect information and spilling correct information in an information retrieval device for providing information requested by a user by retrieving a document including a specific keyword from a specific document set, and further retrieving documents including other keywords related to the specific keyword, from the retrieved document set. <P>SOLUTION: This information retrieval device has a function of retrieving a first document including a specific retrieval word, from the retrieval object document set, extracting a first word coinciding with prescribed extraction conditions, from the first document, retrieving a second document including the extracted first word, from the retrieval object document set, and extracting a second word coinciding with the prescribed extraction conditions, from the second document. The information retrieval device is provided with a ranking means 52 for ranking by giving importance to the second word according to the weight of the second document. For instance, in the case the second document and the first document are the same document, the weight of the second document is made large. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は情報検索装置、具体的には、特定の文書集合から特定のキーワードを含む文書を検索し、更に検索された文書の集合から該特定のキーワードに関連する他のキーワードを含む文書を検索することによりユーザが求める情報を提供する情報検索装置に関するものである。
【０００２】
【従来の技術】
情報検索装置には、検索入力である特定のキーワードに基づいて検索した検索結果（文書集合）から、その文書集合中に出現する単語の統計情報（文書重み、出現位置、単語長、単語種別、文字列一致状況、ＴＦ／ＩＤＦなどの各種パラメータ）を計算し、計算結果に基づいて該特定のキーワードに関連する関連キーワードを抽出し、抽出された関連キーワードを検索入力として再度検索を行うことにより、必要な情報を得るものがある（例えば特許文献１参照）。
【０００３】
この装置を利用すれば、例えばユーザがあるテーマに関心を持つ場合、該テーマに関連した人の名前及びその所属組織を簡単に知ることができる。以下にこのような情報検索装置の使用方法について、ユーザが「燃料電池」の開発等に携わる人の名前及びその所属組織を検索する場合を例に取り以下に説明する。尚、ここでは「山田太郎」は燃料電池に関わる人物であって「○○大学」に所属し、「佐藤花子」も燃料電池に関わる人物であって、「××大学」に所属するものとし、検索範囲はｄｏｃ１〜ｄｏｃ５であると仮定する。
【０００４】
方法１：先ず、「燃料電池」をキーワード（検索ワード）として検索を行う。そしてヒットした文書から、単語種別が人名である単語をチェックし、「燃料電池」に関連した人名を関連キーワードとして抽出する。次に、この関連キーワード（人名）を新たな検索ワードとして再度検索を行う。そしてヒットした文書から、単語種別が組織名である単語をチェックしその人が所属する組織の名称を抽出する。
以下に図２を参照して上記方法１についてより具体的に説明する。「燃料電池」をキーワードとした最初の検索で、集合Ａの文書（ｄｏｃ１，ｄｏｃ２）が検索され、各文書から人名（山田太郎，佐藤花子）が関連キーワードとして抽出される。次に、関連キーワード「山田太郎」で再び検索を行うと、集合Ｂの文書（ｄｏｃ１，ｄｏｃ３，ｄｏｃ５）が検索され、各文書から組織名（○○大学，△△大学）が抽出される。
【０００５】
方法２：別の方法として、２回目の検索を行わず、最初に検索された文書だけを対象として、同じ文書に名前及び組織名が共に出現していたときに、その人物がその組織に所属していると判断することも考えられる。即ち、図２の文書ｄｏｃ１において、「山田太郎」と「○○大学」が共に出現しているので、「燃料電池」に関わる人物「山田太郎」の所属組織は「○○大学」であると推定できる。
【０００６】
【特許文献１】
特開平１１−２５１０８号公報
【０００７】
【発明が解決しようとする課題】
しかしながら、方法１では上に説明したように、２回目の検索では「○○大学」と「△△大学」が検索され、「山田太郎」が「○○大学」に所属すると推定することはできない。方法１は、１回目の検索の結果、文書集合Ａが得られたという事実を利用しておらず、同じ文書集合に対して異なるキーワードで検索を２回行うだけである。そのため、ｄｏｃ１に含まれる「山田太郎」とｄｏｃ５に含まれる「山田太郎」とが同姓同名の別人であっても、それぞれの所属組織を抽出してしまう可能性が高い。
【０００８】
一方、方法２では、１回目の検索しか行わないため、所属組織を取りこぼす可能性がある。例えば、ｄｏｃ４は２回目の検索を行わないため検索できず、「佐藤花子」の所属組織「××大学」は抽出できない。
【０００９】
本発明は上記問題に鑑みなされたものであり、特定の文書集合から特定のキーワードを含む文書を検索し、更に検索された文書の集合から該特定のキーワードに関連する他のキーワードを含む文書を検索することによりユーザが求める情報を提供する情報検索装置において、従来に比べ、誤った情報を提供する可能性及び正しい情報を取りこぼす可能性を大幅に低減することを目的とする。
【００１０】
【課題を解決するための手段】
上記目的を達成すべく、検索対象文書集合から特定の検索ワードを含む第１の文書を検索し、該第１の文書から所定の抽出条件に合致する第１の単語を抽出し、前記検索対象文書集合から該抽出された第１の単語を含む第２の文書を検索し、該第２の文書から所定の抽出条件に合致する第２の単語を抽出する手段を有する本発明の情報検索装置は、
前記第２の文書に、前記第１の文書の検索結果に応じた重みを付与する重み付け手段と、
前記第２の文書の重みに応じて前記第２の単語に重要度を付与して順位付けする第１のランキング手段と
を備えることを特徴とする。
【００１１】
【発明の実施の形態】
第１の実施形態
図１にこの発明の第１の実施形態に係る情報検索装置の構成を示す。この装置は、入出力部１、検索条件作成部２、文書情報記憶部３、文書検索部４、キーワード自動抽出部５とを含む。
【００１２】
キーワード自動抽出部５は出現単語呼出部５１、単語ランキング部５２から構成される。入出力部１はユーザからの検索要求を受け付け、情報検索の結果を出力するものである。例えば、「燃料電池」に関連する人名とその所属組織を検索要求として受け付け、それらの情報（人名と組織名）を出力する。検索条件作成部２は、ユーザからの検索要求の内容に応じ、検索ワードを決定するとともに、検索要求にマッチした情報が出力されるように単語抽出の際の文書の重み付けの方法を決定する。例えば、検索要求が「燃料電池」に関連する人の人名と所属組織であれば、「燃料電池」に関連する人の人名を抽出するプロセスと、抽出された人の所属組織を抽出するプロセスとに分け、各プロセスごとに検索ワードを決定する。
【００１３】
文書情報記憶部３は、文書検索部４で文書を検索するために必要なインデックス情報や、図２（ａ）に示したような検索対象文書に含まれる単語の単語情報（単語名、単語種別など）を格納する。単語種別には、人名、組織名、役職、場所名などがある。文書検索部４は、検索条件作成部２が決定した検索ワードで、検索範囲の文書を検索し、検索された文書のＩＤを出力する。出現単語呼出部５１は、文書検索部４が出力するＩＤを有する文書のそれぞれについて、文書情報記憶部３を参照して、文書に含まれる単語のそれぞれについて単語情報を呼び出す。単語ランキング部５２は、出現単語呼出部５１が呼び出した単語情報を統計処理し、検索条件作成部２が決定した重み付け方法に従って単語に重要度を付与し、順位付けする。尚、後述するように、本実施形態では重み付けの際には、文書検索部の検索結果の情報も利用する。
【００１４】
以下に、図３のフローチャートを参照して上記情報検索装置の動作を説明する。ここでは、入出力部１が、「燃料電池」に関連する人名とその所属組織の名称を検索要求として受け付けたものとする。
【００１５】
先ず、検索条件作成部２は、検索要求に従い、ステップ１００において、検索手順（文書の検索条件と、キーワードをランキングするための重み付けの方法）を決定する。ここでは、検索手順を以下に示すＰｒｏｃｅｓｓ．１とＰｒｏｃｅｓｓ．２のそれぞれについて決定する。
Ｐｒｏｃｅｓｓ．１：特定の文書集合の中から「燃料電池」を含む文書を検索し、検索された文書に含まれる人名を抽出する。このときデフォルトの重み付けにより出現頻度の高い人名の順位を高くする。
Ｐｒｏｃｅｓｓ．２：Ｐｒｏｃｅｓｓ．１で抽出された人名の中、順位の高いものについてその人名を含む文書を上記特定の文書集合の中から検索する。そして検索された文書に含まれる組織を、重要度を付与して、即ち順位を付けて出力する。このとき、Ｐｒｏｃｅｓｓ．１でも検索された文書の重みを大きくし、Ｐｒｏｃｅｓｓ．１で検索された文書にも現れる組織の順位が高くなるようにする。
【００１６】
これらの検索手順は、Ｐｒｏｃｅｓｓ．１とＰｒｏｃｅｓｓ．２とに分けて不図示の検索状況一時記憶部に記憶される。図４（ａ）に検索状況一時記憶部の初期登録時の状態を示す。検索状況一時記憶部では、処理順序、検索ワード、出力情報、重みづけ方法、結果文書リスト、ランキング結果（抽出された重要度の大きい単語）の項目がある。検索ワードの項目には検索ワードが格納される。但しＰｒｏｃｅｓｓ．２の欄では、まだ検索ワードが決まっていないのでＰｒｏｃｅｓｓ．１のランキング結果を検索ワードにするという情報を格納する。
【００１７】
出力情報の項目には出力すべき単語（抽出すべき単語）の単語種別を格納する。重みづけ方法の項目にはどのように重み付けを行うかを示す情報を格納する。結果文書リストの項目には文書検索部４で検索された文書のリストを格納する。ランキング結果項目には単語ランキング部５２のランキング結果を格納する。
【００１８】
ステップ１００の時点では、まだ検索処理を実行していないので結果文書リストの項目とランキング結果の項目は空になっている。ステップ１００で検索手順を決定した後、処理対象（実行位置）を検索状況一時記憶部のＰｒｏｃｅｓｓ．１の先頭に設定する（ステップ１１０）。次に、検索条件作成部２は検索状況一時記憶部から検索ワードの項目を取り出す。取り出した検索ワードの項目が単語であるか否かを調べ（ステップ１２０）、単語であればステップ１４０へ進み、「Ｐｒｏｃｅｓｓ．ｍのランキング結果」であれば、ステップ１３０へ進む（ｍは１以上の整数であり、本実施形態ではｍ＝１）。Ｐｒｏｃｅｓｓ．１の場合は、検索ワードの項目は単語「燃料電池」であるのでステップ１４０へ進む。Ｐｒｏｃｅｓｓ．２の場合は、検索ワードの項目は「Ｐｒｏｃｅｓｓ．ｍのランキング結果」であるのでステップ１３０へ進む。
【００１９】
詳細は後述するが、ステップ１３０に進んだ場合は、検索条件作成部２は、Ｐｒｏｃｅｓｓ．１の処理で抽出された１つまたは複数の単語をそれぞれ検索ワードとして検索を行うプロセスを検索状況一時記憶部に追加し、この追加したプロセスの先頭に処理対象を設定してからステップ１４０へ進む。ステップ１４０では、文書検索部４は検索ワードで文書検索を行い、検索結果を検索状況一時記憶部の結果文書リスト項目に格納する。図２（ａ）の文書集合の場合には、「燃料電池」で検索するとｄｏｃ１とｄｏｃ２がヒットするので、図４（ｂ）に示すようにＰｒｏｃｅｓｓ．１の結果文書リストの項目にはｄｏｃ１とｄｏｃ２が登録される。
【００２０】
次に、出現単語呼出部５１は、Ｐｒｏｃｅｓｓ．１の結果文書リストの項目に登録されている文書ｄｏｃ１とｄｏｃ２に含まれる単語の単語情報を文書情報記憶部３から呼び出す（ステップ１５０）。図２に示す文書集合の場合、単語種別が人名である「山田太郎」、「佐藤花子」が抽出される。次に、単語ランキング部５２は、抽出された単語に対し、設定された重みづけ方法に従って、重要度を付与し、ランキングする（ステップ１６０）。ランキングは以下に示す条件１及び条件２に従って行われる。
【００２１】
条件１：重みづけ方法の項目の内容が「Ｐｒｏｃｅｓｓ．ｍの結果文書リスト」ならば、Ｐｒｏｃｅｓｓ．ｍの検索文書リストにある文書に出現する単語の重要度を大きくする。単語ｔが文書Ｄ１〜Ｄｎにそれぞれ出現するとき、本実施形態では、
【数１】

で計算する。ここでｄｏｃ＿ａｐｐｅａｒ（Ｄｋ，ｍ）は、文書ＤｋがＰｒｏｃｅｓｓ．ｍの結果文書リストに含まれる文書であれば「１」の値を取り、そうでなければ「０」の値を取る。また、ｗ_０，ｗ_１は定数である。
条件２：重みづけ方法の項目の内容が「デフォルト」ならば、出現数が多い単語ほど重要度を大きくする。単語ｔが文書Ｄ１〜Ｄｎに出現するとき、本実施形態では
単語ｔの重み＝ｎとする
【００２２】
上記の条件１又は２に従い、各単語について重要度を付与する。Ｐｒｏｃｅｓｓ．１の処理では、条件２が適用され、ｄｏｃ１に含まれる人名「山田太郎」とｄｏｃ２に含まれる「佐藤花子」の出現数はそれぞれ１回なので、
単語「山田太郎」の重要度＝１．０
単語「佐藤花子」の重要度＝１．０
となる。その後、次のプロセスが存在するか否かを調べ（ステップ１７０）、存在する場合には検索条件作成部２は、次のプロセスの先頭を処理対象に設定し、ステップ１２０へ戻る。次のプロセスが存在しなければ、入出力部１は検索結果を出力する（ステップ１８０）。
【００２３】
以下、Ｐｒｏｃｅｓｓ．２の処理を図４を参照して説明する。
ステップ１２０では、検索ワードの項目の内容は「Ｐｒｏｃｅｓｓ．１のランキング結果」であるのでステップ１３０へ進む。ステップ１３０において、「山田太郎」、「佐藤花子」をそれぞれ検索ワードとする検索を行うために、処理順序の項目にＰｒｏｃｅｓｓ．２−１とＰｒｏｃｅｓｓ．２−２とを追加する。その結果、検索状況一時記憶部の内容は図４（ｃ）のようになる。ここで、追加したＰｒｏｃｅｓｓ．２−１の先頭を処理対象に設定してステップ１４０へ進む。
【００２４】
ステップ１４０では、文書検索部４は「山田太郎」を検索ワードにして文書検索を行う。例えば、図２の文書集合の場合、「山田太郎」で検索すると、ｄｏｃ１、ｄｏｃ３、ｄｏｃ５がヒットし、図４（ｄ）に示すようにＰｒｏｃｅｓｓ．２−１の結果文書リストの項目にはこれらの文書が登録される。ステップ１５０では、ｄｏｃ１、ｄｏｃ３，ｄｏｃ５に出現する組織名「○○大学」と「△△大学」が抽出される。
【００２５】
Ｐｒｏｃｅｓｓ．２−１の重みづけ方法の項目の内容が「Ｐｒｏｃｅｓｓ．１の結果文書リスト」であるので、ステップ１６０では、Ｐｒｏｃｅｓｓ．１の結果文書にリストの項目に登録されているｄｏｃ１に含まれる組織名が優先されるような重み付けを行う。ここでｗ_０＝１．０、ｗ_１＝１０．０とすると、

となり、Ｐｒｏｃｅｓｓ．１でヒットした文書に含まれる組織名「○○大学」に高い重要度が付与される。
【００２６】
Ｐｒｏｃｅｓｓ．２−２の処理も上記と同様である。この場合、「佐藤花子」を含む文書はｄｏｃ２、ｄｏｃ４であり、組織名としてｄｏｃ４から「××大学」が抽出される。ｗ_０＝１．０，ｗ_１＝１０．０とすると、単語「××大学」の重要度＝１．０＋１０．０ × ０＝１．０となり、重要度は低いが所属組織として抽出することができる。
【００２７】
図４（ｅ）に最終的な処理結果を示す。即ち、“「燃料電池」に関連する人の名前とその所属組織”という検索要求に対し、
”山田太郎” ○○大学
”佐藤花子” ××大学
がユーザに提供される。
【００２８】
以上説明したように、第１の実施形態によれば、あるテーマに関連する人の名前とその所属組織を検索する際に、テーマを表す単語を検索ワードとした検索でヒットした文書が、人名を表す単語を検索ワードとした別の検索でもヒットした場合に、その文書の重みを大きく設定することにより、テーマを表す単語、人名を表す単語、及び組織を表す単語が同じ文書に揃って出現している場合に、優先的にその文書に記載の組織を表す単語を所属組織として抽出することができる。更に、テーマを表す単語を含んでいない文書からも組織名を抽出し、その人の所属組織を発見ですることができる。これによって、情報の取りこぼしを減らし、高精度の情報を得ることが可能になる。
【００２９】
第２の実施形態
第１の実施形態では、人名は姓と名が揃っている場合には効果的に所属組織を検索できる。しかし、人名は常に姓と名が揃って文書中に出現するとは限らない。例えば、「山田教授」のように姓のみが役職と共に出現し、名が文書中に現れないことがある。以下に説明する第２の実施形態によれば人名から所属組織を検索する際に、人名が姓だけで表される文書からも所属組織を見つけることができる。第２の実施形態の情報検索装置の構成は、図１に示した実施の形態１と同様であるが、図５に示すように、文書情報記憶部３に各単語の出現位置を示す情報も格納する点で第１の実施形態と異なる。尚、出現位置は、文書の先頭からの文字数である。
【００３０】
第２の実施形態の装置の動作を図６のフローチャートとを参照して以下に説明する。ここでは、入出力部１が「燃料電池」に関連する人の名前とその所属組織を検索要求として受け付けたものとする。
【００３１】
ステップ２００では第１の実施形態のステップ１００と同様の処理を行う。但し、第２の実施形態ではＰｒｏｃｅｓｓ．２に代えてＰｒｏｃｅｓｓ．２ａをＰｒｏｃｅｓｓ．１の後に実行するように設定する。
Ｐｒｏｃｅｓｓ．２ａではＰｒｏｃｅｓｓ．１で抽出された「姓」の中、重要度の大きい、即ち順位の高いものについてその「姓」を含む文書を検索し、検索された文書に出現する単語の中、単語種別が組織であるものを抽出する。Ｐｒｏｃｅｓｓ．２ａではまた、Ｐｒｏｃｅｓｓ．１で抽出された「姓」の中、その近傍（例えば出現位置が１０文字以内）に「役職」を表す単語が存在するものに大きい重要度を付与する。図７（ａ）にこれらの手順を実行した後の検索状況一時記憶部の状態を示す。
【００３２】
ステップ２１０では第１の実施形態のステップ１１０と同様の処理を行いステップ２２０へ進む。ステップ２２０では、第１の実施形態のテップ１２０と同様、検索ワードの項目が単語であるか否かを判断する。Ｐｒｏｃｅｓｓ．１では図７（ａ）に示すように検索ワードの項目は、単語「燃料電池」であるのでステップ２４０へ進む。Ｐｒｏｃｅｓｓ．２ａの処理では検索ワードの項目は、「Ｐｒｏｃｅｓｓ．１のランキングの「姓」」であるのでステップ２３０へ進み、ステップ２３０において第１の実施形態のステップ１３０と同様の処理を行ってステップ２４０へ進む。
【００３３】
ステップ２４０では第１の実施形態のステップ１４０と同様の処理を行う。例えば、図５の文書集合において、「燃料電池」で検索した場合には、ｄｏｃ１がヒットするので、Ｐｒｏｃｅｓｓ．１の結果文書リスト項目にｄｏｃ１を登録する（図７（ｂ））。続いてステップ２５０で第１の実施形態のステップ１５０と同様の処理を行い、ステップ２６０へ進む。ステップ２６０では、ｄｏｃ１を検索し、その中に含まれる「人名」または「姓」、及びそれらに付随する「役職」を抽出する。単語ランキング部５２は、設定された重みづけ方法に従って抽出された単語をランキングする。ランキングは次の条件に従って行う。
【００３４】
条件１：重みづけ方法の項目の内容が「Ｐｒｏｃｅｓｓ．ｍのランキングの役職」ならば、検索ワードと一致した「姓」の近傍（例えば、出現位置が１０文字以内）に、Ｐｒｏｃｅｓｓ．ｍのランキングの役職が存在するかどうかをヒットした各文書についてチェックする。存在すれば、その文書の重みを大きくする。別の役職が存在すれば、その文書の重みを低くする。
本実施形態では、検索ワード（姓）と同じ単語ｔが文書Ｄ１〜Ｄｎに出現した場合、
【数２】

で計算する。
ｄｏｃ＿ａｐｐｅａｒ（Ｄｋ，ｍ）は、文書Ｄｋ中に、検索ワード（姓）と一致する単語の近傍に、ランキング結果の「役職」が存在するならばｗ１、別の役職が存在するならば−ｗ２、役職が存在しないなら０の値を取る。具体例は後述する。
条件２：第１の実施形態のステップ１６０の条件２と同様のランキング処理を行う。
【００３５】
ランキング処理の終了後、ステップ２７０へ進み、第１の実施形態のステップ１７０と同様の処理を行い、続いてステップ２８０で第１の実施形態のステップ１８０と同様の処理を行う。
【００３６】
Ｐｒｏｃｅｓｓ．１において、ステップ２６０までの処理が終了すると、Ｐｒｏｃｅｓｓ．２ａの先頭を処理対象に設定し、ステップ２１０へ戻る。以下、Ｐｒｏｃｅｓｓ．２ａの処理を図６参照して説明する。
ステップ２２０では、Ｐｒｏｃｅｓｓ．２の検索ワード項目は、Ｐｒｏｃｅｓｓ．１のランキング結果であると判定し、ステップ２３０へ進む。ステップ２３０では、”山田太郎”の姓の「山田」を検索ワードに設定するための処理を、検索状況一時記憶部に追加する。その結果、検索状況一時記憶部の状態は図７（ｃ）のようになる。そして追加されたＰｒｏｃｅｓｓ．２ａ−１の先頭を処理対象に設定してからステップ２４０へ進む。
【００３７】
ステップ２４０において、文書検索部４は「山田」を検索ワードにして検索を行う。例えば図５の文書集合の場合、ｄｏｃ１、ｄｏｃ２、ｄｏｃ３がヒットするので図７（ｄ）に示すように、Ｐｒｏｃｅｓｓ．２ａ−１の結果文書リストの項目にｄｏｃ１、ｄｏｃ２、ｄｏｃ３を登録する。ステップ２５０では、図５に示す文書情報記憶部の内容に基づき、ｄｏｃ２，ｄｏｃ３に含まれる組織名「○○大学」、「△△大学」を抽出する。
【００３８】
Ｐｒｏｃｅｓｓ．２ａ−１の重みづけ方法の項目が「Ｐｒｏｃｅｓｓ．１のランキングの「役職」」なので、ここではｄｏｃ１に記載された役職「教授」が「山田」の近傍に出現する文書が優先され、大きな重みが付与される。
例えば、ｗ_０＝１．０，ｗ_１＝１０．０，ｗ_２＝１０．０とすると、
単語「○○大学」の重要度＝ｗ_０＋ｄｏｃ＿ａｐｐｅａｒ（ｄｏｃ２，１）＝（１．０＋１０．０）＝１１．０
単語「△△大学」の重要度＝ｗ_０＋ｄｏｃ＿ａｐｐｅａｒ（ｄｏｃ３，１）＝（１．０＋（−１０．０））＝ −９．０
となる。このように、「山田」と「教授」が近い位置にある文書ｄｏｃ２に含まれる組織名「○○大学」が優先されて大きな重要度が付与され、ｄｏｃ３は「山田」と「助教授」が近くにある文書のため、小さな重要度が付与される。
【００３９】
図７（ｄ）に、最終的な処理結果を示す。結果として、「燃料電池」に関連する人の名前とその所属組織は、“山田太郎”（○○大学）となる。
以上説明したように、第２の実施形態によれば、例えば「山田教授」のように「姓」と「役職」が共に出現する場合には「名」が書かれないことを考慮し、人名から所属組織を検索する際には、姓だけで検索を行い、先立って行われた検索で抽出された「役職」（あるいは「名」）が抽出された姓の近傍にあるかどうかをチェックすることで、高精度に所属組織を抽出できる。
【００４０】
第３の実施形態
第１、第２の実施形態では検索対象文書の種類を特に限定していないが、第３の実施形態ではＷｅｂサイトに公開され、ＵＲＬ（ＵｎｉｖｅｒｓａｌＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）で位置を特定できる文書に検索対象を限定し、文書の重み付けを行う際にそのＵＲＬを利用する。
【００４１】
例えば、図２（１）のｄｏｃ１のＵＲＬが「ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ」から始まり、ｄｏｃ３のＵＲＬが、「ｈｔｔｐ：／／ｗｗｗ．ｂｂｂ．ａｃ．ｊｐ」から始まる場合には、この２つの文書に出現する「山田太郎」は別人である可能性が高い。なぜなら、２つの文書のｗｅｂ上の位置が大きく異なるからである。
反対に、例えば、ｄｏｃ２とｄｏｃ４のＵＲＬが「ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌａｂ／ｘｘｘｘ」まで一致するならば、２つの文書に出現する「佐藤花子」は同一人物である可能性が高い。そこで第３の実施形態では、検索された文書のＵＲＬの位置関係を利用して重み付けを行う。
【００４２】
図８に第３の実施形態の情報検索装置の構成を示す。同図に示すように、第３の実施形態は、文書位置類似度判定部６が追加されている点で第１の実施形態と構成が異なる。
文書位置類似度判定部６は、２つのＵＲＬを比較入力として受け取り、２つのＵＲＬの位置関係を示す情報を類似度として出力する機能を有する。また、文書情報記憶部３は各文書のＵＲＬ情報も格納している。ここで、図２に示した各文書が次のＵＲＬを持つものとする。
【００４３】
ｄｏｃ１：ｈｔｔｐ．／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｘｘ／ｙｙ／ｐｅｒｓｏｎｌ．ｈｔｍｌ
ｄｏｃ２：ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌａｂ／ｘｘｘｘ／ａｌ．ｈｔｍｌ
ｄｏｃ３：ｈｔｔｐ：／／ｗｗｗ．ｂｂｂ．ａｃ．ｊｐ／ｚｚ／ｗｗ／ｉｎｄｅｘ．ｈｔｍｌ
ｄｏｃ４：ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌａｂ／ｘｘｘｘ／ｂｌ．ｈｔｍ
ｄｏｃ５：ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｚｚ／ｗｗ／ｐｐ／ｉｎｄｅｘ．ｈｔｍｌ
【００４４】
次に、図９を参照して第３の実施形態の動作を説明する。第３の実施形態の動作は、ステップ１００からステップ１５０、ステップ１７０、及びステップ１８０については第１の実施形態と同じであり、ステップ１６０に代えて以下に説明するステップ３００を実行する点で第１の実施形態と異なる。
ステップ３００において、単語ランキング部５２は下記の条件に従って文書を重み付けし、単語をランキングする。
【００４５】
条件２は、第１の実施形態のステップ１６０の条件２と同様であるので説明は省略し、条件１について説明する。
条件１：重みづけ方法の項目の内容が「Ｐｒｏｃｅｓｓ．ｍの結果文書リスト」ならば、再度の検索で抽出された単語ｔが出現した文書のＵＲＬとＰｒｏｃｅｓｓ．ｍの検索文書リストにある各文書のＵＲＬとを比較し、比較結果に基づいて単語ｔに重要度を付与する。本実施形態では単語ｔの重要度を下記の式に従って計算する。
【数３】

【００４６】
関数ｕｒｌ＿ｄｉｓｔａｎｃｅの値は、文書位置類似度判定部６が例えば以下に示す手順で計算する。
ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐとｈｔｔｐ：／ｗｗｗ．ｂｂｂ．ａｃ．ｊｐのように、ドメイン名が異なる場合（前者はａａａ．ａｃ．ｊｐ，後者はｂｂｂ．ａｃ．ｊｐ）には０とする。
ｈｔｔｐ：／／ｗｗｗ．ｘｘ．ａａａ．ａｃ．ｊｐとｈｔｔｐ：／／ｗｗｗ．ｙｙ．ａａａ．ａｃ．ｊｐのように、サブドメイン名が異なる場合（前者はｘｘ．ａａａ．ａｃ．ｊｐ，後者はｙｙ．ａａａ．ａｃ．ｊｐ）には０．１とする。
ｈｔｔｐ：／／ｗｗｗ．ｙｙ．ａａａ．ａｃ．ｊｐ／ｉｎｆｏ／１．ｈｔとｈｔｔｐ：／／ｗｗｗ．ｙｙ．ａａａ．ａｃ．ｊｐ／ｏｒｇ／２．ｈｔｍｌのように、ドメイン名のみが一致している場合は０．２とする。
【００４７】
さらに、ディレクトリの一致状況に応じて次のように点数付けを行う。
「ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌｅｖｅｌ−１／」までなら、０．５
「ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌｅｖｅｌ−１／ｌｅｖｅｌ−２」までなら、０．７５（１／２＋１／４）
「ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐ／ｌｅｖｅｌ−１／ｌｅｖｅｌ−２／ｌｅｖｅｌ−３」までなら、０．８７５（１／２＋１／４＋１／８）
…
【数４】

とする。例えば、ｄｏｃ１とｄｏｃ５では、ｈｔｔｐ：／／ｗｗｗ．ａａａ．ａｃ．ｊｐまでが一致するので０．２となる。
【００４８】
以下、「山田太郎」の所属組織の計算結果について述べる。抽出された組織名は、
「○○大学」と「△△大学」である。ｗ_０＝１．０，ｗ_１＝１０．０で計算すると、

【００４９】
次に、「佐藤花子」の所属組織の計算結果について述べる。抽出された組織名は「××大学」である。同様に、ｗ_０＝１．０，ｗ_１＝１０・０で計算すると、

となる。第１の実施形態では重みが低かったが、第３の実施形態では、ｄｏｃ２とｄｏｃ４のＵＲＬが類似しているために重みが高くなっている。
【００５０】
第３の実施形態では、ランキングを行う際にＵＲＬの位置関係を利用し、同じサイトにある場合など、位置が近い文書に高い重みを設定することにより高精度に重み付けを行うことができる。
【００５１】
以上説明した第１〜第３の実施形態は、ユーザからの検索要求が、あるテーマに関連する人の名前と所属組織の抽出である場合についてのものであるが本発明はそれに限定されるものではない。例えば、図４で、Ｐｒｏｃｅｓｓ．１の出力情報を組織名とし、Ｐｒｏｃｅｓｓ．２の出力情報を人名とすることにより、あるテーマに関連した組織名とその組織に所属する人の名を抽出することもできる。
文書検索部４はキーワード検索を行うものに限られない。ユーザからの検索入力を、「燃料電池」のような単語に代えて文書とし、文書検索部４で類似文書検索を行うようにしてもよい。
第３の実施形態で説明したＵＲＬ距離の計算方法は、一例であり、ＵＲＬの比較結果に基づき、文書間の距離を求める任意の方法を用いることができる。
【００５２】
第１、２、３の実施形態における重みの計算方法を、特開平１１−２５１０８に記載の重み計算方法と組み合わせることも可能である。
第２の実施形態では、Ｐｒｏｃｅｓｓ．２の重み付けでは「役職」を用いたが、「名」を用いてもよい。また、第１の実施形態と第２の実施形態とを組み合わせてもよい。
各実施形態では、検索を検索ワードを代えて連続して行っているが、最初の人名抽出結果だけをユーザに出力し、ユーザが指定した人名についてだけ所属組織を検索するようにしてもよい。
【００５３】
【発明の効果】
本発明によれば、特定の文書集合から特定のキーワードを含む文書を検索し、更に検索された文書の集合から該特定のキーワードに関連する他のキーワードを含む文書を検索することによりユーザが求める情報を提供する情報検索装置において、従来に比べ、誤った情報を提供する可能性及び正しい情報を取りこぼす可能性が大幅に低減される。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係る情報検索装置の構成を示すブロック図である。
【図２】実施の形態１の情報検索装置の文書情報記憶部の構造を示す図である。
【図３】実施の形態１の情報検索装置の動作を説明するフローチャートである。
【図４】実施の形態１の情報検索装置の検索状況一時記憶部の状態の変化を説明する図である。
【図５】実施の形態１の情報検索装置の文書情報記憶部の構造を示す図である。
【図６】本発明の実施の形態２に係る情報検索装置の動作を説明するフローチャートである。
【図７】実施の形態２の情報検索装置の検索状況一時記憶部の状態の変化を説明する図である。
【図８】本発明の実施の形態３に係る情報検索装置の構成を示すブロック図である。
【図９】実施の形態３の情報検索装置の動作を説明するフローチャートである。
【符号の説明】
１入出力部、２検索条件作成部、３文書情報記憶部、４文書検索部、５キーワード自動抽出部、６文書位置類似度判定部、５１出現単語呼び出し部、５２単語ランキング部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information retrieval apparatus, specifically, a document that includes a specific keyword from a specific document set, and further retrieves a document that includes another keyword related to the specific keyword from the retrieved document set. The present invention relates to an information search device that provides information required by a user by doing so.
[0002]
[Prior art]
The information search device includes, based on a search result (a set of documents) searched based on a specific keyword as a search input, statistical information (word weight, appearance position, word length, word type, word type, etc.) of words appearing in the set of documents. By calculating a character string matching status, various parameters such as TF / IDF), extracting a related keyword related to the specific keyword based on the calculation result, and performing a search again using the extracted related keyword as a search input. In some cases, necessary information is obtained (for example, see Patent Document 1).
[0003]
By using this device, for example, when the user is interested in a certain theme, the name of the person related to the theme and the organization to which the user belongs can be easily known. Hereinafter, a method of using such an information search apparatus will be described below, taking as an example a case where a user searches for the name of a person involved in the development of a “fuel cell” and the organization to which the user belongs. In this case, "Taro Yamada" is a person related to fuel cells and belongs to "XX University", and "Hanako Sato" is also a person related to fuel cells and belongs to "XX University". , And the search range is doc1 to doc5.
[0004]
Method 1: First, a search is performed using “fuel cell” as a keyword (search word). Then, a word whose word type is a person name is checked from the hit document, and a person name related to “fuel cell” is extracted as a related keyword. Next, the related keyword (person name) is searched again as a new search word. Then, from the hit document, a word whose word type is an organization name is checked, and the name of the organization to which the person belongs is extracted.
Hereinafter, the method 1 will be described more specifically with reference to FIG. In the first search using "fuel cell" as a keyword, documents (doc1, doc2) of set A are searched, and a personal name (Taro Yamada, Hanako Sato) is extracted from each document as a related keyword. Next, when the search is performed again with the related keyword “Taro Yamada”, the documents (doc1, doc3, doc5) of the set B are retrieved, and the organization name (○ university, △△ university) is extracted from each document.
[0005]
Method 2: As another method, if the name and organization name appear in the same document only for the first searched document without performing the second search, the person belongs to the organization It is also conceivable to judge that they are doing so. That is, since “Taro Yamada” and “XX University” both appear in the document doc1 of FIG. 2, the organization to which the person “Taro Yamada” related to “fuel cell” belongs is “XX University”. Can be estimated.
[0006]
[Patent Document 1]
JP-A-11-25108
[0007]
[Problems to be solved by the invention]
However, in method 1, as described above, “XX University” and “△△ University” are searched in the second search, and it cannot be estimated that “Taro Yamada” belongs to “XX University”. . The method 1 does not use the fact that the document set A is obtained as a result of the first search, and only performs the search twice for the same document set with different keywords. Therefore, even if “Taro Yamada” included in doc1 and “Taro Yamada” included in doc5 are different persons having the same last name, there is a high possibility that respective affiliations will be extracted.
[0008]
On the other hand, in the method 2, since only the first search is performed, there is a possibility that the belonging organization may be omitted. For example, doc4 cannot be searched because the second search is not performed, and the organization “xx university” to which “Sato Hanako” belongs cannot be extracted.
[0009]
The present invention has been made in view of the above problems, and searches a document including a specific keyword from a specific document set, and further searches a document including another keyword related to the specific keyword from the searched document set. It is an object of the present invention to significantly reduce the possibility of providing incorrect information and the possibility of missing correct information in an information search device that provides information required by a user by searching.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, a first document including a specific search word is searched from a set of search target documents, and a first word that meets a predetermined extraction condition is extracted from the first document. An information retrieval apparatus according to the present invention, having means for retrieving a second document including the extracted first word from a set of documents and extracting a second word meeting a predetermined extraction condition from the second document. Is
Weighting means for assigning a weight to the second document according to a search result of the first document;
First ranking means for assigning importance to the second words and ranking them according to the weight of the second document;
It is characterized by having.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
First embodiment
FIG. 1 shows the configuration of an information search device according to the first embodiment of the present invention. This apparatus includes an input / output unit 1, a search condition creation unit 2, a document information storage unit 3, a document search unit 4, and a keyword automatic extraction unit 5.
[0012]
The keyword automatic extracting unit 5 includes an appearing word calling unit 51 and a word ranking unit 52. The input / output unit 1 receives a search request from a user and outputs a result of the information search. For example, a person name related to “fuel cell” and its affiliated organization are received as a search request, and the information (personal name and organization name) is output. The search condition creation unit 2 determines a search word in accordance with the content of the search request from the user, and determines a method of weighting the document at the time of word extraction so that information matching the search request is output. For example, if the search request is the name and organization of the person related to "fuel cell", a process of extracting the name of the person related to "fuel cell" and a process of extracting the organization of the extracted person And determine the search word for each process.
[0013]
The document information storage unit 3 stores index information necessary for the document search unit 4 to search for a document, and word information (word name, word type, etc.) included in the search target document as shown in FIG. Etc.). The word type includes a person name, an organization name, a position, a place name, and the like. The document search unit 4 searches for documents within the search range using the search word determined by the search condition creation unit 2, and outputs the ID of the searched document. The appearing word calling unit 51 refers to the document information storage unit 3 for each document having the ID output by the document searching unit 4 and calls word information for each of the words included in the document. The word ranking unit 52 performs statistical processing on the word information called by the appearing word calling unit 51, assigns importance to words according to the weighting method determined by the search condition creating unit 2, and ranks the words. As described later, in the present embodiment, the information of the search result of the document search unit is also used for weighting.
[0014]
Hereinafter, the operation of the information retrieval apparatus will be described with reference to the flowchart of FIG. Here, it is assumed that the input / output unit 1 has received, as a search request, the name of the person related to “fuel cell” and the name of the organization to which the user belongs.
[0015]
First, in step 100, the search condition creating unit 2 determines a search procedure (document search conditions and a weighting method for ranking keywords) in accordance with the search request. Here, the search procedure shown in Process. 1 and Process. 2 is determined.
Process. 1: A document including "fuel cell" is searched from a specific document set, and a person name included in the searched document is extracted. At this time, the rank of the frequently appearing person name is increased by default weighting.
Process. 2: Process. Among the person names extracted in step 1, a document including the person name with a higher rank is searched from the above-mentioned specific document set. Then, the organizations included in the retrieved documents are assigned importance, that is, ranked and output. At this time, Process. 1, the weight of the retrieved document is increased, and Process. The order of the organization that also appears in the document searched in step 1 is made higher.
[0016]
These search procedures are described in Process. 1 and Process. 2 are stored in a search status temporary storage unit (not shown). FIG. 4A shows a state of the search status temporary storage unit at the time of initial registration. The search status temporary storage unit includes items of a processing order, a search word, output information, a weighting method, a result document list, and a ranking result (an extracted word having a high importance). The search word item stores the search word. However, Process. In the column of Process.2, since the search word has not yet been determined, Process. The information that the ranking result of No. 1 is used as a search word is stored.
[0017]
The item of output information stores the word type of the word to be output (word to be extracted). Information indicating how to perform weighting is stored in the item of weighting method. The item of the result document list stores a list of documents searched by the document search unit 4. The ranking result item stores the ranking result of the word ranking section 52.
[0018]
At the time of step 100, the items of the result document list and the items of the ranking result are empty because the search processing has not been executed yet. After the search procedure is determined in step 100, the processing target (execution position) is stored in Process. 1 (step 110). Next, the search condition creation unit 2 extracts the item of the search word from the search status temporary storage unit. It is checked whether or not the item of the retrieved search word is a word (step 120). If it is a word, the process proceeds to step 140, and if it is “Processing.m ranking result”, the process proceeds to step 130 (m is 1 or more). Where m = 1) in the present embodiment. Process. In the case of 1, since the item of the search word is the word “fuel cell”, the process proceeds to step 140. Process. In the case of 2, since the item of the search word is “Processing.m ranking result”, the process proceeds to step 130.
[0019]
Although details will be described later, when the process proceeds to step 130, the search condition creation unit 2 executes the process. A process for performing a search using one or a plurality of words extracted in the process 1 as search words is added to the search status temporary storage unit, a process target is set at the head of the added process, and the process proceeds to step 140. . In step 140, the document search unit 4 performs a document search using the search word, and stores the search result in the result document list item of the search status temporary storage unit. In the case of the document set shown in FIG. 2A, doc1 and doc2 are found by searching for "fuel cell". Therefore, as shown in FIG. Doc1 and doc2 are registered in the items of the result document list of No. 1.
[0020]
Next, the appearing word calling unit 51 sets the Process. 1, the word information of the words included in the documents doc1 and doc2 registered in the items of the result document list is called from the document information storage unit 3 (step 150). In the case of the document set shown in FIG. 2, "Taro Yamada" and "Hanako Sato" whose word types are person names are extracted. Next, the word ranking unit 52 assigns importance to the extracted words according to the set weighting method, and ranks the words (step 160). Ranking is performed in accordance with the following

conditions

1 and 2.
[0021]
Condition 1: If the content of the item of the weighting method is “result document list of Process.m”, Process.m The importance of words appearing in documents in the search document list of m is increased. When the word t appears in each of the documents D1 to Dn, in the present embodiment,
(Equation 1)

Calculate with Here, doc_appear (Dk, m) indicates that document Dk is Process. If the document is included in the result document list of m, a value of “1” is taken, otherwise, a value of “0” is taken. Also, w ₀ , W ₁ Is a constant.
Condition 2: If the content of the item of the weighting method is “default”, a word having a larger number of appearances has a higher importance. When the word t appears in the documents D1 to Dn, in the present embodiment,
Weight of word t = n
[0022]
According to the

above conditions

1 or 2, importance is assigned to each word. Process. In the processing of 1, the condition 2 is applied, and the number of appearances of the personal name “Taro Yamada” included in doc1 and the occurrence of “Hanako Sato” included in doc2 are each one time.
Importance of the word "Taro Yamada" = 1.0
Importance of the word "Sato Hanako" = 1.0
It becomes. Thereafter, it is checked whether or not the next process exists (step 170). If there is, the search condition creation unit 2 sets the head of the next process as a processing target, and returns to step 120. If there is no next process, the input / output unit 1 outputs a search result (step 180).
[0023]
Hereinafter, Process. The process 2 will be described with reference to FIG.
In step 120, the content of the item of the search word is “Processing.1 ranking result”, so the process proceeds to step 130. In step 130, in order to perform a search using “Taro Yamada” and “Hanako Sato” as search words, Process. 2-1 and Process. 2-2 is added. As a result, the contents of the search status temporary storage unit are as shown in FIG. Here, the added Process. The head of 2-1 is set as a processing target, and the process proceeds to step 140.
[0024]
In step 140, the document search unit 4 performs a document search using "Taro Yamada" as a search word. For example, in the case of the document set of FIG. 2, when searching for “Taro Yamada”, doc1, doc3, and doc5 are hit, and as shown in FIG. These documents are registered in the item of the result document list of 2-1. In step 150, the organization names "XX university" and "△△ university" appearing in doc1, doc3, and doc5 are extracted.
[0025]
Process. Since the content of the item of the weighting method of 2-1 is “Process.1 result document list”, in Step 160, Process. 1 is weighted so that the organization name included in doc1 registered in the list item in the result document 1 has priority. Where w ₀ = 1.0, w ₁ = 10.0

And Process. A high degree of importance is assigned to the organization name “XX University” included in the document hit in 1.
[0026]
Process. The process of 2-2 is the same as above. In this case, the documents containing “Hanako Sato” are doc2 and doc4, and “xx university” is extracted from doc4 as the organization name. w ₀ = 1.0, w ₁ Assuming = 10.0, the importance of the word “xx university” = 1.0 + 10.0 × 0 = 1.0, and although the importance is low, it can be extracted as a belonging organization.
[0027]
FIG. 4E shows the final processing result. That is, in response to a search request for “the name of a person related to“ fuel cell ”and its organization”,
"Taro Yamada" XX University
"Hanako Sato" ×× University
Is provided to the user.
[0028]
As described above, according to the first embodiment, when searching for a name of a person related to a certain theme and an organization to which the person belongs, a document hit by a search using a word representing the theme as a search word is replaced with a personal name. If a search is also made using the word representing the word as a search word, by setting the weight of the document large, the word representing the theme, the word representing the person's name, and the word representing the organization appear together in the same document If so, the word representing the organization described in the document can be preferentially extracted as the belonging organization. Furthermore, an organization name can be extracted from a document that does not include a word indicating a theme, and the organization to which the person belongs can be found. As a result, missing of information can be reduced, and highly accurate information can be obtained.
[0029]
Second embodiment
In the first embodiment, when the personal name is the same as the first name and the last name, the belonging organization can be effectively searched. However, personal names do not always appear in documents with their first and last names. For example, there may be a case where only the last name appears with the title and the first name does not appear in the document, such as “Professor Yamada”. According to the second embodiment described below, when searching for a belonging organization from a person's name, the belonging organization can also be found from a document in which a person's name is represented only by a last name. The configuration of the information search device of the second embodiment is the same as that of the first embodiment shown in FIG. 1, but as shown in FIG. 5, information indicating the appearance position of each word is also stored in the document information storage unit 3. It differs from the first embodiment in that it is stored. The appearance position is the number of characters from the beginning of the document.
[0030]
The operation of the device according to the second embodiment will be described below with reference to the flowchart of FIG. Here, it is assumed that the input / output unit 1 has received the name of the person related to “fuel cell” and the organization to which the person belongs as the search request.
[0031]
In step 200, the same processing as in step 100 of the first embodiment is performed. However, in the second embodiment, Process. 2 in place of Process. 2a to Process. Set to execute after 1.
Process. 2a, Process. Among the "surnames" extracted in step 1, a document containing the "surname" is searched for a document having a higher degree of importance, that is, a document having a higher rank. Among the words appearing in the searched documents, the word type is "organization". Extract things. Process. In 2a, Process. Among the "surnames" extracted in step 1, a word having a word representing "post" in the vicinity thereof (for example, the appearance position is within 10 characters) is assigned a high importance. FIG. 7A shows the state of the search status temporary storage unit after executing these procedures.
[0032]
In step 210, the same processing as step 110 of the first embodiment is performed, and the process proceeds to step 220. In step 220, similarly to step 120 of the first embodiment, it is determined whether or not the item of the search word is a word. Process. In FIG. 1, the item of the search word is the word “fuel cell” as shown in FIG. Process. In the process of 2a, the item of the search word is “Last name” in the ranking of Process. 1, so the process proceeds to step 230. In step 230, the same process as step 130 of the first embodiment is performed, and the process proceeds to step 240. move on.
[0033]
In step 240, the same processing as in step 140 of the first embodiment is performed. For example, in the document set shown in FIG. 5, when "fuel cell" is searched, doc1 is hit. Doc1 is registered in the result document list item No. 1 (FIG. 7B). Subsequently, in step 250, the same processing as in step 150 of the first embodiment is performed, and the process proceeds to step 260. In step 260, doc1 is searched to extract "personal name" or "last name" contained therein and "post" attached thereto. The word ranking unit 52 ranks the extracted words according to the set weighting method. Rankings are made according to the following conditions.
[0034]
Condition 1: If the content of the item of the weighting method is “post of ranking of Process.m”, Process.m is placed in the vicinity of “last name” that matches the search word (for example, the appearance position is within 10 characters). A check is made for each hit document to see if there is a position with a ranking of m. If so, increase the weight of the document. If another position exists, the weight of the document is reduced.
In the present embodiment, when the same word t as the search word (surname) appears in the documents D1 to Dn,
(Equation 2)

Calculate with
doc_appear (Dk, m) is w1 if the “post” of the ranking result exists near the word that matches the search word (surname) in the document Dk, and −w2 if another post exists. Takes a value of 0 if there is no post. A specific example will be described later.
Condition 2: A ranking process similar to the condition 2 in step 160 of the first embodiment is performed.
[0035]
After the ranking processing is completed, the process proceeds to step 270, where the same processing as step 170 of the first embodiment is performed, and subsequently, at step 280, the same processing as step 180 of the first embodiment is performed.
[0036]
Process. In step 1, when the processing up to step 260 is completed, Process. The head of 2a is set as a processing target, and the process returns to step 210. Hereinafter, Process. The process of 2a will be described with reference to FIG.
In step 220, Process. The search word item of Process.2 is Process. It is determined that the ranking result is No. 1 and the process proceeds to step 230. In step 230, a process for setting the last name “Yamada” of “Taro Yamada” as a search word is added to the search status temporary storage unit. As a result, the state of the search status temporary storage unit is as shown in FIG. And the added Process. After the head of 2a-1 is set as the processing target, the process proceeds to step 240.
[0037]
In step 240, the document search unit 4 performs a search using "Yamada" as a search word. For example, in the case of the document set of FIG. 5, since doc1, doc2, and doc3 are hit, as shown in FIG. Doc1, doc2, doc3 are registered in the result document list item 2a-1. In step 250, based on the contents of the document information storage unit shown in FIG. 5, the organization names "XX university" and "@ university" included in doc2 and doc3 are extracted.
[0038]
Process. Since the item of the weighting method of 2a-1 is “Position” in the ranking of Process.1, a document in which the post “Professor” described in doc1 appears in the vicinity of “Yamada” is prioritized, and the weight is large. Is given.
For example, w ₀ = 1.0, w ₁ = 10.0, w ₂ = 10.0
Importance of the word "XX University" = w ₀ + Doc_appear (doc2,1) = (1.0 + 10.0) = 11.0
Importance of the word "△△ University" = w ₀ + Doc_appear (doc3,1) = (1.0 + (-10.0)) =-9.0
It becomes. As described above, the organization name “XX University” included in the document doc2 in a position where “Yamada” and “professor” are close to each other is given high priority, and doc3 is close to “Yamada” and “assistant professor”. Are given a small importance.
[0039]
FIG. 7D shows the final processing result. As a result, the name of the person related to “fuel cell” and the organization to which it belongs are “Taro Yamada” (XX University).
As described above, according to the second embodiment, when the “last name” and “post” appear together, for example, “Professor Yamada”, the “first name” is not written. When searching for an affiliated organization from, search only by last name and check whether the "post" (or "first name") extracted in the preceding search is near the extracted last name This makes it possible to extract affiliated organizations with high accuracy.
[0040]
Third embodiment
In the first and second embodiments, the type of the search target document is not particularly limited. However, in the third embodiment, the search target is set to a document which is published on a Web site and whose position can be specified by a URL (Universal Resource Locator). The URL is limited and the URL is used when weighting the document.
[0041]
For example, when the URL of doc1 in FIG. 2A starts from “http://www.aaa.ac.jp” and the URL of doc3 starts from “http://www.bbb.ac.jp”. , "Taro Yamada" appearing in these two documents is likely to be another person. This is because the positions of the two documents on the web are significantly different.
Conversely, for example, if the URLs of doc2 and doc4 match up to “http://www.aaa.ac.jp/lab/xxxx”, “Hanako Sato” appearing in the two documents may be the same person High. Therefore, in the third embodiment, weighting is performed using the positional relationship of the URL of the searched document.
[0042]
FIG. 8 shows the configuration of the information search device of the third embodiment. As shown in the figure, the third embodiment differs from the first embodiment in that a document position similarity determination unit 6 is added.
The document position similarity determination unit 6 has a function of receiving two URLs as a comparison input and outputting information indicating a positional relationship between the two URLs as a similarity. Further, the document information storage unit 3 also stores URL information of each document. Here, it is assumed that each document shown in FIG. 2 has the following URL.
[0043]
doc1: http. // www. aaa. ac. jp / xx / yy / personl. html
doc2: http: // www. aaa. ac. jp / lab / xxxxxx / al. html
doc3: http: // www. bbb. ac. jp / zz / www / index. html
doc4: http: // www. aaa. ac. jp / lab / xxxx / bl. htm
doc5: http: // www. aaa. ac. jp / zz / www / pp / index. html
[0044]
Next, the operation of the third embodiment will be described with reference to FIG. The operation of the third embodiment is the same as that of the first embodiment in steps 100 to 150, 170 and 180, and differs from the first embodiment in that step 300 described below is executed instead of step 160. Different from the first embodiment.
In step 300, the word ranking unit 52 ranks words by weighting documents according to the following conditions.
[0045]
The condition 2 is the same as the condition 2 in step 160 of the first embodiment, and thus the description is omitted, and the condition 1 is described.
Condition 1: If the content of the item of the weighting method is “Process.m result document list”, the URL of the document in which the word t extracted by the second search appears and Process. A comparison is made with the URL of each document in the search document list of m, and importance is assigned to the word t based on the comparison result. In this embodiment, the importance of the word t is calculated according to the following equation.
[Equation 3]

[0046]
The value of the function url_distance is calculated by the document position similarity determination unit 6 according to the following procedure, for example.
http: // www. aaa. ac. jp and http: / www. bbb. ac. jp, the value is 0 when the domain names are different (the former is aaa.ac.jp and the latter is bbb.ac.jp).
http: // www. xx. aaa. ac. jp and http: // www. yy. aaa. ac. If the subdomain names are different like jp (the former is xx.aaa.ac.jp and the latter is yy.aaa.ac.jp), it is 0.1.
http: // www. yy. aaa. ac. jp / info / 1. ht and http: // www. yy. aaa. ac. jp / org / 2. In the case where only the domain names match, such as html, the value is 0.2.
[0047]
Furthermore, scoring is performed as follows according to the matching status of directories.
If it is up to “http://www.aaa.ac.jp/level-1/”, 0.5
0.75 (1/2 + 1/4) up to "http://www.aaa.ac.jp/level-1/level-2"
0.875 (1/2 + 1/4 + 1/8) up to "http://www.aaa.ac.jp/level-1/level-2/level-3"
…
(Equation 4)

And For example, in doc1 and doc5, http: // www. aaa. ac. Since jp and jp match, 0.2 is obtained.
[0048]
The calculation results of the organization to which "Taro Yamada" belongs are described below. The extracted organization name is
"○ University" and "△△ University". w ₀ = 1.0, w ₁ = 10.0

[0049]
Next, the calculation results of the organization to which Hanako Sato belongs will be described. The extracted organization name is “xx university”. Similarly, w ₀ = 1.0, w ₁ = 1 · 0

It becomes. Although the weight is low in the first embodiment, the weight is high in the third embodiment because the URLs of doc2 and doc4 are similar.
[0050]
In the third embodiment, when ranking is performed, the positional relationship between URLs is used, and when documents are located at the same site, a high weight is set for a document having a close position, so that weighting can be performed with high accuracy.
[0051]
The first to third embodiments described above relate to the case where the search request from the user is the extraction of the name and organization of the person related to a certain theme, but the present invention is not limited to this. is not. For example, in FIG. 1 is the organization name, and Process. By using the output information of step 2 as a person's name, an organization name related to a certain theme and names of persons belonging to the organization can also be extracted.
The document search unit 4 is not limited to one that performs a keyword search. The search input from the user may be a document instead of a word such as “fuel cell”, and the document search unit 4 may perform a similar document search.
The method of calculating the URL distance described in the third embodiment is an example, and any method of calculating the distance between documents based on the URL comparison result can be used.
[0052]
The weight calculation method in the first, second, and third embodiments can be combined with the weight calculation method described in JP-A-11-25108.
In the second embodiment, Process. Although the “post” is used in the weighting of “2”, the “name” may be used. Further, the first embodiment and the second embodiment may be combined.
In each of the embodiments, the search is performed continuously while changing the search word. However, only the first person name extraction result may be output to the user, and only the person name designated by the user may be searched for the belonging organization.
[0053]
【The invention's effect】
According to the present invention, the user obtains a document by searching for a document including a specific keyword from a specific document set, and further searching for a document including other keywords related to the specific keyword from the set of searched documents. In an information search device that provides information, the possibility of providing erroneous information and the possibility of missing correct information are greatly reduced as compared with the related art.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an information search device according to Embodiment 1 of the present invention.
FIG. 2 is a diagram illustrating a structure of a document information storage unit of the information search device according to the first embodiment.
FIG. 3 is a flowchart illustrating an operation of the information search device according to the first embodiment.
FIG. 4 is a diagram illustrating a change in the state of a search status temporary storage unit of the information search device according to the first embodiment.
FIG. 5 is a diagram illustrating a structure of a document information storage unit of the information search device according to the first embodiment.
FIG. 6 is a flowchart illustrating an operation of the information search device according to the second embodiment of the present invention.
FIG. 7 is a diagram illustrating a change in the state of a search status temporary storage unit of the information search device according to the second embodiment.
FIG. 8 is a block diagram showing a configuration of an information search device according to Embodiment 3 of the present invention.
FIG. 9 is a flowchart illustrating an operation of the information search device according to the third embodiment.
[Explanation of symbols]
1 Input / output unit, 2 Search condition creation unit, 3 Document information storage unit, 4 Document search unit, 5 Automatic keyword extraction unit, 6 Document position similarity determination unit, 51 Appeared word calling unit, 52 Word ranking unit.

Claims

A first document including a specific search word is searched for from a set of search target documents, a first word that meets a predetermined extraction condition is extracted from the first document, and the extracted first word is extracted from the set of search target documents. An information retrieval apparatus comprising: means for searching for a second document including a first word and extracting a second word that meets a predetermined extraction condition from the second document,
Weighting means for assigning a weight to the second document according to a search result of the first document;
An information retrieval apparatus, comprising: a first ranking unit that assigns importance to the second words according to the weight of the second document and ranks the second words.

When there are a plurality of said second documents, said first ranking means assigns importance to said second words in accordance with the sum of weights of said plurality of second documents. 2. The information search device according to 1.

A second ranking unit that assigns importance to the first word based on statistical information including at least the frequency of appearance of the first word, and when a plurality of first words are extracted, The information processing apparatus according to claim 1, wherein a second word is extracted for each of the items having higher importance.

4. The apparatus according to claim 1, wherein the weighting unit assigns a large weight to the second document when the second document is the same as the first document. 5. Described information retrieval device.

The search target document set is a set of documents displayed on a web site, and the weighting unit compares the URL of the second document with the URL of the first document, and based on the comparison result, 4. The information retrieval apparatus according to claim 1, wherein a weight is assigned to the second document.

4. The method according to claim 1, wherein the weighting unit assigns a large weight to the second document when at least a part of the first word is included in the second document. An information retrieval device according to claim 1.

In a case where the first word is a word representing a person's name and the second word is a word representing a belonging organization, the first ranking means is included in the first document in the second document 4. The information retrieval apparatus according to claim 3, wherein when the same word as a word having a high rank representing a position appears near the first word, a large weight is given to the second document.

In a case where the first word is a word representing a person's name and the second word is a word representing an organization to which the first word belongs, the first ranking means is arranged in the second document in the vicinity of the first word. 4. The information retrieval apparatus according to claim 3, wherein a small weight is given to the second document when a word representing the appearing position is different from a word having a higher rank representing the position included in the first document. .

When the first word is a word representing a person's name and the second word is a word representing an organization to which the first word belongs, the first ranking means determines that the part representing the name of the first word is the second word. 4. The information retrieval apparatus according to claim 3, wherein a large weight is given to the second document when the second document appears near a word representing a last name with a higher rank.