JP2004086307A

JP2004086307A - Information retrieving device, information registering device, information retrieving method, and computer readable program

Info

Publication number: JP2004086307A
Application number: JP2002243537A
Authority: JP
Inventors: Koji Maekawa; 前川　浩司
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-08-23
Filing date: 2002-08-23
Publication date: 2004-03-18

Abstract

<P>PROBLEM TO BE SOLVED: To eliminate retrieval oversight in retrieving by a natural sentence, greatly improve operability, suppress retrieval processing at minimum, and increase a retrieval speed. <P>SOLUTION: A CPU 2 of the information processing device provides a unique document ID to a document to be registered in registration by a processing program 4-1, extracts a word as a keyword from the document, collates word group data for categorizing a plurality of words which has a specific meaning such as a different notation, variant character, equivalent term, and synonym by the predetermined unit, provides a standard notation to the extracted keyword, creates data for retrieval in which word group information is made as an index and a document ID is registered, extracts the word as the keyword from a retrieval condition in retrieving, collates the word group data, provides the standard notation to the extracted keyword, retrieves the data for retrieval on the basis of the standard notation, determines whether the retrieval result corresponds with the retrieval condition or not, and creates a final retrieval result. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、コンピュータ等において文書等の情報を検索する場合に適用される情報検索装置、情報登録装置、情報検索方法、及びコンピュータ読み取り可能なプログラムに関する。
【０００２】
【従来の技術】
従来、検索条件を入力し、キーワードに基づき検索対象の文書を検索し、検索結果を得る文書検索方法が各種提案されている。
【０００３】
図２４は従来技術１の文書検索処理を説明するための図である。検索者は検索条件を自然文によって指定する。検索クエリとして「アメリカの大統領」が入力される。入力された検索文字列に対して、形態素解析などの単語切り出し手段によってキーワード抽出を行ない、「アメリカ」及び「大統領」というキーワードが抽出される。抽出したキーワードを基に検索条件を設定する。ここでは、［アメリカ］ＡＮＤ［大統領］という検索条件によって検索を行なう。予め登録されている検索対象文書に対して、上記の検索条件によって検索を行なう。検索の結果、文書ＩＤ＝１「アメリカ大統領が来日」と文書ＩＤ＝２「アメリカ大統領選挙」を検索結果として得ることができる。尚、図中、検索結果文書の下に記載しているものは検索結果除外文書である。
【０００４】
図２５は従来技術２の文書検索処理を説明するための図である。検索者は検索条件をキーワードの論理式等によって指定する。検索式として以下の検索式を指定する。
［［アメリカ］ＯＲ［米国］ＯＲ［合衆国］］ＡＮＤ［［大統領］］
この時、「アメリカ」や「大統領」といった単語表記を含む文書を検索したい場合、検索者は検索漏れが起きないように論理式を指定しなければいけない。予め登録されている検索対象文書に対して、上記の検索条件によって検索を行なう。検索の結果、文書ＩＤ＝１「アメリカの大統領が来日」、文書ＩＤ＝２「アメリカ大統領選挙」、文書ＩＤ＝３「米国で大統領選挙」、文書ＩＤ＝４「米国ではプレジデントが最高の」をそれぞれ検索結果として得ることができる。尚、図中、検索結果文書の下に記載しているものは検索結果除外文書である。
【０００５】
図２６は従来技術３の文書検索処理を説明するための図である。検索者は検索条件を自然文によって指定する。検索クエリとして「アメリカの大統領」が入力される。入力された検索文字列に対して、形態素解析などの単語切り出し手段によってキーワード抽出を行ない、「アメリカ」及び「大統領」というキーワードが抽出される。抽出されたキーワードを基に、検索者の指示或いは自動的に類義語や同義語などの情報を用いてキーワード拡張する。拡張したキーワードを基に検索条件を設定する。ここでは、以下の検索条件に従って検索を実行する。
［［アメリカ］ＯＲ［米国］ＯＲ［合衆国］ＯＲ［ＵＳＡ］］ＡＮＤ［［大統領］ＯＲ［プレジデント］］
予め登録されている検索対象文書に対して、上記の検索条件によって検索を行なう。検索の結果、文書ＩＤ＝１「アメリカの大統領が来日」、文書ＩＤ＝２「アメリカ大統領選挙」、文書ＩＤ＝３「米国で大統領選挙」、文書ＩＤ＝４「米国ではプレジデントが最高の」、文書ＩＤ＝５「初代合衆国大統領」をそれぞれ検索結果として得ることができる。
【０００６】
情報検索に関する技術としては、例えば特開平７−６５０１３号公報や特開平８−２５５１６３号公報などが、上記の従来技術３に含まれる技術である。
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来技術においては次のような問題があった。
【０００８】
従来技術１では、検索クエリ「アメリカの大統領」に対して「米国」や「プレジデント」で表記された検索結果を得ることができなかった（図２４の検索結果除外文書参照）。そのため、検索漏れが多く、すべての検索結果を取得したい場合や、同義語や表記のゆれが多い単語を検索する場合、満足する検索結果を得ることはできなかった。
【０００９】
従来技術２では、検索者自身が検索式を考えて設定しなければならないために、網羅的に検索したい場合は、複雑な検索式を自ら設定するために、検索者の負担が増大し操作性において非常に使い勝手が悪かった。
【００１０】
従来技術３では、同義語や類義語などの情報を基にキーワードを拡張し、検索を行なうために、漏れの少ない検索結果が得ることができたが、検索キーワードが増加するために検索速度が遅くなるという欠点があった。
【００１１】
本発明は、上述した点に鑑みなされたものであり、自然文による検索を行なうときの検索漏れを解消し、操作性を大幅に向上し、検索処理を最小限に抑え、検索速度の高速化を実現可能とした情報検索装置、情報登録装置、情報検索方法、及びコンピュータ読み取り可能なプログラムを提供することを第１の目的とする。
【００１２】
また、本発明は、検索用データや内部処理に要するメモリを節約し、検索用データの検索処理や、検索用データへの追加登録処理を高速に実行可能とした情報検索装置、情報登録装置、情報検索方法、及びコンピュータ読み取り可能なプログラムを提供することを第２の目的とする。
【００１３】
また、本発明は、検索結果が多数存在した場合においても、検索者の探したい情報をより早く取得し、操作性の向上を可能とした情報検索装置、情報登録装置、情報検索方法、及びコンピュータ読み取り可能なプログラムを提供することを第３の目的とする。
【００１４】
【課題を解決するための手段】
上記目的を達成するため、本発明は、情報を検索する情報検索装置であって、情報の登録時に、登録対象の情報からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を見出しとした検索用データを作成する登録処理手段と、情報の検索時に、検索条件からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を基に前記検索用データを検索し、前記検索した内容を前記検索条件と比較することで妥当性を判定し、検索結果を作成する検索処理手段とを有することを特徴とする。
【００１５】
また、本発明は、情報を検索する情報検索方法であって、情報の登録時に、登録対象の情報からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を見出しとした検索用データを作成し、情報の検索時に、検索条件からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を基に前記検索用データを検索し、前記検索した内容を前記検索条件と比較することで妥当性を判定し、検索結果を作成することを特徴とする。
【００１６】
また、本発明は、コンピュータに、情報の登録時に、登録対象の情報からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を見出しとした検索用データを作成する機能と、情報の検索時に、検索条件からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を基に前記検索用データを検索し、前記検索した内容を前記検索条件と比較することで妥当性を判定し、検索結果を作成する機能とを実行させるコンピュータ読み取り可能なプログラムであることを特徴とする。
【００１７】
また、本発明は、情報登録装置において、登録対象の情報からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を見出しとした検索用データを作成する登録処理手段を有することを特徴とする。
【００１８】
また、本発明は、情報検索装置において、検索条件からキーワードとなる情報を抽出し、前記抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、前記正規化を行なった情報を基に前記検索用データを検索し、前記検索した内容を前記検索条件と比較することで妥当性を判定し、検索結果を作成する検索処理手段を有することを特徴とする。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。
【００２０】
［第１の実施の形態］
図１は本発明の第１の実施の形態に係る情報処理装置（情報検索装置）の構成を示すブロック図である。情報処理装置は、文書や画像などのマルチメディア情報を管理すると共に、管理しているマルチメディア情報の中から検索者が検索条件を設定し必要とするファイルを取得可能とするものであり、入力装置１、ＣＰＵ２、出力装置３、記憶装置４を備えている。
【００２１】
入力装置１は、キーボードなどから構成されており、検索条件（クエリ）などの入力に用いる。ＣＰＵ２は、情報処理装置各部を制御する中央演算処理装置であり、記憶装置４上に展開された処理プログラム４−１により後述の図４（検索文書登録処理）、図５（文書検索処理）、図６（標準表記取得処理）の各フローチャートに示す処理を実行させる。出力装置３は、ディスプレイなどから構成されており、検索結果を表示出力する。記憶装置４は、メモリやハードディスクなどから構成されており、処理プログラム４−１、標準単語データ４−２、検索用データ４−３、後述の文書ＩＤ管理情報を記憶する。
【００２２】
キーボードなどの入力装置１から入力された検索条件（クエリ）は、メモリやハードディスクなどの記憶装置４上に展開された処理プログラム４−１により、ＣＰＵ２で処理される。処理プログラム４−１は、同じく記憶装置４に記憶されている標準単語データ４−２を参照して抽出キーワードを標準単語に変更し、標準単語を見出しとし文書の情報や文書に含まれる単語の特徴を格納した検索用データ４−２から、入力装置１から入力された検索条件に対する文書の類似性を計算して取得し、その検索結果をディスプレイなどの出力装置３に出力する。
【００２３】
また、本発明の情報検索に関わる動作環境としては、上記図１に示した単体の情報処理装置（コンピュータ）に対応することができる以外に、図２に示すようなローカルなネットワーク環境及び図３に示すようなインターネット環境にも対応することができる。図２のローカルなネットワーク環境は、サーバ２１と複数のクライアント２２、２３、２４、２５をＬＡＮ２６を介して接続した例であり、各クライアント２２〜２５がサーバ２１を利用することで情報検索を行なう。また、図３のインターネット環境は、端末３１をインターネット３２に接続した例であり、端末３１がインターネット３２に接続することで情報検索を行なう。
【００２４】
次に、上記の如く構成された第１の実施の形態に係る情報処理装置における情報検索に関わる動作を図４乃至図１９を参照しながら詳細に説明する。図４乃至図６は第１の実施の形態に係る各処理の流れを示すフローチャートである。尚、図４乃至図６の各フローチャートに示す処理は、情報処理装置のＣＰＵ２が記憶装置４上に展開された処理プログラム４−１により実行する。
【００２５】
＜検索文書登録処理＞
先ず、第１の実施の形態に係る検索文書登録処理を図４を用いて説明する。図４は検索文書登録処理を示すフローチャートである。ステップＳ１０１では、検索者は登録したい文書を情報処理装置の入力装置１から指定する。登録したい文書を指定する場合、１文書でも複数文書でも指定することが可能である。ステップＳ１０２では、情報処理装置のＣＰＵ２は、上記ステップＳ１０１で指定された文書に対して固有の文書ＩＤを付与する。図７は文書ＩＤ付与処理の例を示す図である。情報処理装置のＣＰＵ２は、文書ＩＤの使用の有無を管理している文書ＩＤ管理情報を参照して、上記ステップＳ１０１で指定された文書Ａに対し、まだ割り当てられていない文書ＩＤを付与する。
【００２６】
第１の実施の形態の例では、図７の文書ＩＤ管理情報において文書ＩＤ＝１〜１０は既に割り当て済みであるが、文書ＩＤ＝１１がまだ割り当てられていないため、情報処理装置のＣＰＵ２は、文書Ａに１１というＩＤを付与する。情報処理装置のＣＰＵ２は、同時に、文書ＩＤ管理情報にもＩＤ（１１）が使用済みである旨を記憶することで文書ＩＤ管理情報を更新する。これによって、情報処理装置（システム）内では文書Ａは文書ＩＤ（１１）として扱うことができる。
【００２７】
ステップＳ１０３では、情報処理装置のＣＰＵ２は、形態素解析処理などによって、文書内に登録するキーワードを抽出する。
【００２８】
図８は文書解析処理の例を示す図である。図８では、文書Ａの内容の「米国では大統領選挙の開票を行なった。」という解析文からキーワード抽出を行なう処理について説明する。上記解析文に対して形態素解析を実行し形態素の単位に分割することができる。即ち、図８の単語切り出し結果で表される形態素の単位に分割することができる。情報処理装置のＣＰＵ２は、この形態素の中から、キーワードとなりうる単語を抽出する。キーワードを抽出する方法は、自立語をキーワードとして抽出するとか、品詞によって判定するなど種々の抽出方法があるが、第１の実施の形態では自立語をキーワードとして抽出するものとする。その結果、図８の抽出キーワードにある形態素がキーワードとして抽出される。
【００２９】
ステップＳ１０４では、情報処理装置のＣＰＵ２は、抽出した検索キーワードの標準表記（単語群に含まれる単語の中から代表する単語の表記）を取得して標準単語として登録する。標準表記取得処理の詳細は図６のキーワードの標準化処理を用いて後述する。
【００３０】
図９は標準表記データを一覧表とした例を示す図である。多くの単語の場合、活用形や異表記（ひらがな、カタカナ、漢字の違い）、表記のゆれ（送り仮名の違い）、同義語、類義語など、同じ意味を表すために複数の語が存在する。標準表記データは、これらの語を一定の基準で一つの標準的な表記とするためのデータである。例えば図９によると、「アメリカ」、「米国」、「ＵＳＡ」、「合衆国」、「亜米利加」、「ＡＭＥＲＩＣＡ」等は全て「アメリカ」という標準表記で表現することができる。また、動詞、形容詞などは、未然、連用、終止、連体、仮定、命令等の活用があるが、それらはすべて終止形として表現する。図９の例では「行なう」と「おこなう」は異表記、「アメリカ」と「亜米利加」は異体字、「大統領」と「プレジデント」は同義語であり、図９では不図示であるが「来日」と「日本訪問」は類義語である。
【００３１】
図６は標準表記取得処理を示すフローチャートである。ステップＳ３０１では、検索者は上記図４のステップＳ１０３で抽出したキーワードを情報処理装置の入力装置１から入力する。従って、第１の実施の形態の文では「米国」「大統領」「選挙」「開票」「行っ」が対象のキーワードとなる。ステップＳ３０２では、情報処理装置のＣＰＵ２は、それぞれの基となるキーワードで標準表記データを検索する。ステップＳ３０３では、情報処理装置のＣＰＵ２は、記憶装置４に格納されている標準表記データを取得する。検索結果がない（標準表記がない）単語については、単語そのものを標準表記として、標準表記データを取得する。
【００３２】
図１０は上記図９の標準表記データを参照して入力表記から標準表記を取得する例を示す図である。その結果、標準表記として「アメリカ」「大統領」「選挙」「開票」「行なう」を取得することができる。
【００３３】
上記図４に戻り、ステップＳ１０５では、情報処理装置のＣＰＵ２は、それぞれの標準表記を見出しとする検索用データを作成する。
【００３４】
図１１は第１の実施の形態による登録前の検索用データを示す図である。第１の実施の形態での検索用データには、標準表記を見出しとして、出現した文書ＩＤを格納している。例えば、この検索用データによって、単語「アメリカ」は、１，２，３，４，５，８，１０の文書に出現することを表現している。見出し語として存在する場合はデータ内に文書ＩＤを追加し、見出し語が存在しない場合は見出し語と文書ＩＤで新たにデータを作成する。
【００３５】
図１２は検索用データへの登録例を示す図である。図１２では検索用データを参照して「アメリカ」は既に見出しとして存在するので、「アメリカ」の見出しに対応する文書ＩＤ情報の中に今回登録する文書ＩＤ＝１１の情報を追加する。一方、「開票」は見出しとして存在しないために、新たな見出し「開票」で今回登録する文書ＩＤ＝１１で新規登録する。
【００３６】
図１３は第１の実施の形態による登録後の検索用データを示す図である。図１３では既存の見出しデータには今回追加した文書のＩＤ（１１）が格納されている。また、図１３では上記図１１で存在しなかった「開票」については新しく見出しに新規登録されている。
【００３７】
＜文書検索処理＞
次に、第１の実施の形態に係る文書検索処理を図５を用いて説明する。図５は文書検索処理を示すフローチャートである。ステップＳ２０１では、検索者は検索条件を自然文もしくはキーワード論理式のいずれかで情報処理装置の入力装置１から入力する。図１４は第１の実施の形態での検索条件入力の例である。「アメリカの大統領」という自然文による検索条件入力と、［アメリカ］ＡＮＤ［大統領］という論理式による検索条件入力のどちらの検索条件入力でも検索することが可能である。検索条件を自然文で入力した場合には、ステップＳ２０２が実行される。
【００３８】
ステップＳ２０２では、情報処理装置のＣＰＵ２は、上記ステップＳ２０１で入力された自然文の検索条件から検索キーワードの抽出を行なう。ここでは、図１５の検索キーワード抽出を説明する。図１５は検索条件で入力された「アメリカの大統領」という文から検索キーワードを抽出する処理を示す図である。上記検索条件文に対して形態素解析を実行し形態素単位に分割することができる。即ち、図１５の単語切り出し結果で表される形態素の単位に分割することができる。情報処理装置のＣＰＵ２は、この形態素の中から、キーワードとなりうる単語を抽出する。キーワードを抽出する方法は、上記文書登録処理と同じ抽出方法を用いる必要があるために、文書登録処理と同様に自立語をキーワードとして抽出するものとする。その結果、図１５の抽出キーワードにある「アメリカ」と「大統領」の二つの形態素がキーワードとして抽出される。
【００３９】
ステップＳ２０３では、情報処理装置のＣＰＵ２は、抽出したキーワードに対する標準表記を取得して検索キーワードとする。標準表記取得処理の詳細は上記図６のキーワードの標準化処理を用いて説明する。
【００４０】
上記図６のステップＳ３０１では、検索者は上記図４のステップＳ１０３で抽出したキーワードを情報処理装置の入力装置１から入力する。従って、第１の実施の形態の文では「アメリカ」「大統領」が対象のキーワードとなる。ステップＳ３０２では、情報処理装置のＣＰＵ２は、上記図９の標準表記データを参照して入力されたキーワードの標準表記データを検索する。ステップＳ３０３では、情報処理装置のＣＰＵ２は、記憶装置４に格納されている標準表記データを取得する。検索結果がない（標準表記がない）単語については、単語そのものを標準表記として、標準表記データを取得する。その結果、それぞれのキーワードに対する標準表記として、図１６に示すように「アメリカ」「大統領」が標準表記データとして取得される。
【００４１】
次に、上記図５に戻り、ステップＳ２０４では、情報処理装置のＣＰＵ２は、標準表記取得処理で得られた標準表記を基に検索条件を作成して検索用データを検索する。検索用データは上記図１３に示したように見出し領域とデータ領域から構成されている。即ち、見出し情報として標準表記を使用して、データ領域には見出し情報が出現する文書ＩＤの情報が格納されている。図１７は「アメリカ」と「大統領」をそれぞれ検索し、文書ＩＤ情報を取得している例を示す図である。標準表記「アメリカ」を表す単語は、検索文書ＩＤの「１，２，３，４，５，８，１０，１１」にそれぞれ格納されており、同様に「大統領」を表す単語は、検索文書ＩＤの「１，２，３，４，５，６，９，１１」に格納されている、と検索される。
【００４２】
ステップＳ２０５では、情報処理装置のＣＰＵ２は、上記検索用データの検索で得られた結果に対して、検索結果となりうるかどうかの判定を行なう。入力した検索条件「アメリカの大統領」は、最終的には図１８に示すような［アメリカ］ＡＮＤ［大統領］として処理される。従って、｛１，２，３，４，５，８，１０，１１｝ＡＮＤ｛１，２，３，４，５，６，９，１１｝となり、この検索条件を満たす文書ＩＤは、図１８の検索結果に示すように、１，２，３，４，５，１１と判断される。この判断処理では、第１の実施の形態で記述した以外にも、単語の品詞や単語間の関係など、判定基準に用いたい情報を予めデータに登録しておくことによって、自在に組み合わせて使うことが可能である。
【００４３】
例えば、登録時には文書ＩＤと共に、キーワードとなったそれぞれの単語の品詞を登録した検索用データを作成しておき、検索時には文書ＩＤと共に、入力された単語品詞の一致・不一致を参照し、検索結果の可否判定を行なうようにしてもよい。
【００４４】
また、登録時には文書ＩＤと共に、キーワードとなったそれぞれの単語に付随する付属語情報や接尾・接頭辞情報等の情報を登録した検索用データを作成しておき、検索時にはキーワードに付随する前記情報の一致・不一致又は類似性を検索結果の妥当性の判定に加えるようにしてもよい。
【００４５】
また、登録時には文書ＩＤと共に、キーワードとなったそれぞれの単語の出現位置や、タイトル・見出し・箇条書き等の出現状態を登録した検索用データを作成しておき、検索時にはキーワードの位置関係や距離、出現状態を検索結果の妥当性の判定に加えるようにしてもよい。
【００４６】
また、登録時には文書ＩＤと共に、文構造を抽出し単語の係り先や受け先等の情報を登録した検索用データを作成しておき、検索時には検索条件の係り受け関係と検索用データに登録された単語の係り受け関係を検索結果の妥当性の判定に加えるようにしてもよい。
【００４７】
また、登録時には上記情報（キーワードとなったそれぞれの単語の品詞を示す情報、キーワードとなったそれぞれの単語に付随する付属語情報や接尾・接頭辞情報等の情報、キーワードとなったそれぞれの単語の出現位置やタイトル・見出し・箇条書き等の出現状態を示す情報、文構造を抽出し単語の係り先や受け先等の情報）の任意の組み合わせ又は全部を検索用データとして登録し、検索時には検索用データに含まれる全ての情報を参照し、検索結果の妥当性の判定に加えるようにしてもよい。
【００４８】
ステップＳ２０６では、情報処理装置のＣＰＵ２は、検索結果となった文書をディスプレイ等の出力装置３に表示出力する。図１９は検索結果の例を示す図である。「アメリカの大統領」という検索条件に対して、米国、合衆国、プレジデントなどの検索条件にはない単語を、類義語や同義語などを利用して拡張することなく取得することができる。一方で検索結果から除外される文書の例を図１９の除外文書に示す。
【００４９】
また、検索条件文が「米国のプレジデント」だった場合、キーワード抽出処理では「米国」と「プレジデント」がキーワードとして切り出されるが、上記図９の標準表記データを見ると「米国」は「アメリカ」、「プレジデント」は「大統領」にそれぞれ標準化されるために、最終的な検索条件は［アメリカ］ＡＮＤ［大統領］となり、上記で説明した「アメリカの大統領」を入力して検索したときとまったく同じ検索結果を得ることができる。
【００５０】
以上説明したように、第１の実施の形態によれば、文書の登録時には、登録対象の文書に固有の文書ＩＤを付与し、文書からキーワードとなる単語を抽出し、異表記、異体字、同義語、類義語等の特定の意味を持つ複数の単語を一つまたは複数にまとめる単語群データを参照し、抽出キーワードに標準表記を付与し、標準表記を見出しとし文書ＩＤを登録した検索用データを作成する。また、文書の検索時には、指定された検索条件からキーワードとなる単語を抽出し、単語群データを参照し、抽出キーワードに標準表記を付与し、標準表記を基に検索用データを検索し、検索した結果が検索条件に合致するか否かを判定し、最終的な検索結果を作成する。
【００５１】
これにより、従来の技術で問題になっていた、自然文による検索を行なうときの検索漏れを解消することができる。また、論理式入力時の検索者の負担を軽くすることができるために、操作性を大幅に向上することができる。更に、検索漏れを防ぐために、同義語や類義語によるキーワードの展開を行なわないために、検索処理を最小限に抑えることができ、検索速度の高速化を実現できるという効果を得ることができる。
【００５２】
［第２の実施の形態］
本発明の第２の実施の形態に係る情報処理装置（情報検索装置）は、上記第１の実施の形態と同様に、文書や画像などのマルチメディア情報を管理すると共に、管理しているマルチメディア情報の中から検索者が検索条件を設定し必要とするファイルを取得可能とするものであり、入力装置１、ＣＰＵ２、出力装置３、記憶装置４を備えている。情報処理装置各部の詳細は上述したので説明を省略する。尚、第２の実施の形態においても、本発明の情報検索に関わる動作環境としては、単体の情報処理装置以外に、図２のローカルなネットワーク環境及び図３のインターネット環境にも対応することができる。
【００５３】
第２の実施の形態においては、上記第１の実施の形態で形態素解析した単語をすべてＩＤによって置き換え、情報処理装置（システム）内では単語ＩＤで処理する方法を採るものである。
【００５４】
図２０はキーワードを標準単語ＩＤに直すようにした例を示す図である。図２０によると、「アメリカ」、「米国」、「亜米利加」、「合衆国」、「ＵＳＡ」、「ＡＭＥＲＩＣＡ」はすべて単語ＩＤ＝１として情報処理装置（システム）内で処理される。「米国では大統領選挙の開票を行なった。」を登録した場合、標準表記データを参照して得られる単語ＩＤは、それぞれ、米国＝１、大統領＝２、選挙＝５、開票＝４、行なっ＝３となる。更に、それぞれの単語ＩＤを見出しとして、文書ＩＤ（１１）を追加登録する。その結果、検索用データは図２１に示すようになる。
【００５５】
次に、検索条件「アメリカの大統領」が入力された場合、キーワードとして「アメリカ」と「大統領」が抽出される。図２０の標準単語ＩＤデータを参照してＩＤ＝１とＩＤ＝２を標準単語ＩＤとして取得することができる。これによって、上記検索用データを検索し、検索結果として文書ＩＤ＝｛１，２，３，４，５，１１｝を得ることができ、第１の実施の形態と同様の結果を単語ＩＤによって得ることが可能である。
【００５６】
以上説明したように、第２の実施の形態によれば、情報処理装置（システム）内における単語の扱いは文字列ではなくＩＤとしているために、検索用データや内部処理に要するメモリを節約することができるほか、検索用データの検索処理や、検索用データへの追加登録処理において、高速に処理を実行することが可能となる効果を得ることができる。
【００５７】
［第３の実施の形態］
本発明の第３の実施の形態に係る情報処理装置（情報検索装置）は、上記第１の実施の形態と同様に、文書や画像などのマルチメディア情報を管理すると共に、管理しているマルチメディア情報の中から検索者が検索条件を設定し必要とするファイルを取得可能とするものであり、入力装置１、ＣＰＵ２、出力装置３、記憶装置４を備えている。情報処理装置各部の詳細は上述したので説明を省略する。尚、第３の実施の形態においても、本発明の情報検索に関わる動作環境としては、単体の情報処理装置以外に、図２のローカルなネットワーク環境及び図３のインターネット環境にも対応することができる。
【００５８】
第３の実施の形態においては、上記第１の実施の形態や第２の実施の形態で単語インデックスに登録（格納）するデータは文書ＩＤだけであったものに対し、単語の使用個数（出現個数。出現回数）を文書ＩＤと同時に格納するものである。従って、検索用データには（文書ＩＤ／出現回数）のように格納する。
【００５９】
図２２は上記の（文書ＩＤ／出現回数）のように登録した検索用データを示す図である。例えば、「アメリカ」の文書ＩＤデータにおける１／４は、文書ＩＤ＝１の文書に「アメリカ」は４回出現したことを表している。同様に、文書ＩＤデータから「大統領」は文書ＩＤ＝１の文書には３回出現していることがわかる。このデータを基に、それぞれのキーワードの文書における重要度を公知のｔｆ・ｉｄｆ法などによって計算し、その結果を基に文書の妥当性としてスコアを付けることができる。このことによって、図２３に示す検索結果出力例のように、検索者に対して妥当性が高いと思われる検索結果を高速に提示することが可能となる。
【００６０】
以上説明したように、第３の実施の形態によれば、情報検索処理において高速に妥当性の高い文書から検索者に提示することができるために、検索結果が多数存在した場合においても、検索者の探したい情報をより早く得ることが可能となり、操作性の向上という効果を得ることができる。
【００６１】
［他の実施の形態］
第１乃至第３の実施の形態では、情報処理装置（情報検索装置）で文書の登録及び検索を行なう場合を説明したが、本発明はこれに限定されるものではなく、画像が付随した文書の登録及び検索を行なう場合にも適用することができる。
【００６２】
第１乃至第３の実施の形態では、情報処理装置（情報検索装置）で実行する処理プログラムの供給形態は特定のものに限定されるものではなく、処理プログラムを情報処理装置内部に予め格納しておく構成、処理プログラムを情報処理装置外部から供給する構成のどちらでもよい。
【００６３】
第１乃至第３の実施の形態では、情報処理装置（情報検索装置）を図１に示す構成としたが、本発明はこれに限定されるものではなく、例えば情報処理装置にプリンタ等の印刷装置を接続し、検索結果を出力装置により表示出力すると共に印刷装置により印刷出力してもよい。
【００６４】
また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。上述した実施形態の機能を実現するソフトウエアのプログラムコードを記憶した記憶媒体等の媒体をシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体等の媒体に格納されたプログラムコードを読み出し実行することによっても、本発明が達成されることは言うまでもない。
【００６５】
この場合、記憶媒体等の媒体から読み出されたプログラムコード自体が上述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体等の媒体は本発明を構成することになる。
【００６６】
また、プログラムコードを供給するための記憶媒体等の媒体としては、例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ、ＤＶＤ＋ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。
【００６７】
また、コンピュータが読み出したプログラムコードを実行することにより、上述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳなどが実際の処理の一部または全部を行ない、その処理によって上述した実施形態の機能が実現される場合も、本発明に含まれることは言うまでもない。
【００６８】
更に、記憶媒体等の媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によって上述した実施形態の機能が実現される場合も、本発明に含まれることは言うまでもない。
【００６９】
【発明の効果】
以上説明したように、本発明によれば、情報の登録時に、登録対象の情報からキーワードとなる情報を抽出し、抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、正規化を行なった情報を見出しとした検索用データを作成し、情報の検索時に、検索条件からキーワードとなる情報を抽出し、抽出したキーワードとなる情報を類似したもの同士でまとめる正規化を行ない、正規化を行なった情報を基に検索用データを検索し、検索した内容を検索条件と比較することで妥当性を判定し、検索結果を作成するため、従来の技術で問題になっていた、自然文による検索を行なうときの検索漏れを解消することができる。また、論理式入力時の検索者の負担を軽くすることができるために、操作性を大幅に向上することができる。更に、検索漏れを防ぐために、同義語や類義語によるキーワードの展開を行なわないために、検索処理を最小限に抑えることができ、検索速度の高速化を実現できるという効果を得ることができる。
【００７０】
また、情報検索装置内における単語の扱いは文字列ではなくＩＤとしているために、検索用データや内部処理に要するメモリを節約することができるほか、検索用データの検索処理や、検索用データへの追加登録処理において、高速に処理を実行することが可能となる効果を得ることができる。
【００７１】
また、情報検索処理において高速に妥当性の高い文書から検索者に提示できるために、検索結果が多数存在した場合においても、検索者の探したい情報をより早く得ることが可能となり、操作性の向上という効果を得ることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る情報処理装置の構成を示すブロック図である。
【図２】動作環境の一例を示す図である。
【図３】動作環境の一例を示す図である。
【図４】検索文書登録処理の流れを示すフローチャートである。
【図５】文書検索処理の流れを示すフローチャートである。
【図６】キーワードを標準化する標準表記取得処理の流れを示すフローチャートである。
【図７】文書ＩＤ付与処理の例を示す図である。
【図８】文書解析を行ないキーワードを抽出する文書解析処理の例を示す図である。
【図９】標準表記のデータ例を示す図である。
【図１０】標準表記の取得例を示す図である。
【図１１】登録実行前の登録用データの例を示す図である。
【図１２】検索用データへの登録例を示す図である。
【図１３】登録実行後の検索用データの例を示す図である。
【図１４】検索条件の入力例を示す図である。
【図１５】検索条件からキーワードを抽出する例を示す図である。
【図１６】検索キーワードの標準表記の取得例を示す図である。
【図１７】検索用データの検索例を示す図である。
【図１８】検索結果に対する可否判定の例を示す図である。
【図１９】検索結果の例を示す図である。
【図２０】本発明の第２の実施の形態に係る標準単語ＩＤデータの例を示す図である。
【図２１】検索用データの例を示す図である。
【図２２】本発明の第３の実施の形態に係る検索用データの例を示す図である。
【図２３】検索結果の出力例を示す図である。
【図２４】従来技術１の文書検索処理を示す図である。
【図２５】従来技術２の文書検索処理を示す図である。
【図２６】従来技術３の文書検索処理を示す図である。
【符号の説明】
１　入力装置
２　ＣＰＵ（登録処理手段、検索処理手段、ＩＤ付与手段、登録時単語抽出手段、検索時単語抽出手段、登録時情報付与手段、検索時情報付与手段、検索情報登録手段、検索手段、検索結果可否判定手段）
３　出力装置
４　記憶装置
４−１　処理プログラム（登録処理手段、検索処理手段、ＩＤ付与手段、登録時単語抽出手段、検索時単語抽出手段、登録時情報付与手段、検索時情報付与手段、検索情報登録手段、検索手段、検索結果可否判定手段）
４−２　標準表記データ（単語群情報）
４−３　検索用データ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information search device, an information registration device, an information search method, and a computer-readable program which are applied when a computer or the like searches for information such as a document.
[0002]
[Prior art]
Conventionally, various document search methods have been proposed in which a search condition is input, a search target document is searched based on a keyword, and a search result is obtained.
[0003]
FIG. 24 is a diagram illustrating a document search process according to the related art 1. The searcher specifies a search condition by a natural sentence. "US President" is entered as a search query. Keyword extraction is performed on the input search character string by word extraction means such as morphological analysis, and keywords “America” and “President” are extracted. Set search conditions based on the extracted keywords. Here, the search is performed based on the search condition [USA] AND [President]. A search is performed for a search target document registered in advance under the above search conditions. As a result of the search, document ID = 1 “US president comes to Japan” and document ID = 2 “US president election” can be obtained as search results. Note that, in the figure, those described below the search result document are search result exclusion documents.
[0004]
FIG. 25 is a diagram for explaining a document search process according to the related art 2. The searcher specifies search conditions by using a logical expression of a keyword or the like. Specify the following search expression as the search expression.
[[USA] OR [USA] OR [United States]] AND [[President]]
At this time, when searching for a document including a word notation such as "America" or "President", the searcher must specify a logical expression so that search omission does not occur. A search is performed for a search target document registered in advance under the above search conditions. As a result of the search, document ID = 1 “US president comes to Japan”, document ID = 2 “US presidential election”, document ID = 3 “US presidential election”, document ID = 4 “US president is the best” Can be obtained as search results. Note that, in the figure, those described below the search result document are search result exclusion documents.
[0005]
FIG. 26 is a diagram for explaining a document search process according to Prior Art 3. The searcher specifies a search condition by a natural sentence. "US President" is entered as a search query. Keyword extraction is performed on the input search character string by word extraction means such as morphological analysis, and keywords “America” and “President” are extracted. Based on the extracted keywords, keyword expansion is performed by using a searcher's instruction or automatically using information such as synonyms and synonyms. Set search conditions based on expanded keywords. Here, a search is executed according to the following search conditions.
[[USA] OR [USA] OR [United States] OR [USA]] AND [[President] OR [President]]
A search is performed for a search target document registered in advance under the above search conditions. As a result of the search, document ID = 1 “US president comes to Japan”, document ID = 2 “US presidential election”, document ID = 3 “US presidential election”, document ID = 4 “US president is the best” , Document ID = 5 “first president of the United States” can be obtained as a search result.
[0006]
As a technique relating to information retrieval, for example, Japanese Patent Application Laid-Open No. H7-65013 and Japanese Patent Application Laid-Open No. H8-255163 are techniques included in the above-mentioned conventional technology 3.
[0007]
[Problems to be solved by the invention]
However, the prior art described above has the following problems.
[0008]
In the prior art 1, a search result represented by “USA” or “President” cannot be obtained for the search query “President of the United States” (see the search result exclusion document in FIG. 24). For this reason, satisfactory search results could not be obtained when there were many search omissions and it was desired to obtain all search results, or when searching for words with many synonyms and spelling variations.
[0009]
In the prior art 2, since the searcher himself has to consider and set the search formula, if he / she wants to perform a comprehensive search, he / she sets a complicated search formula himself, which increases the burden on the searcher and increases operability. Was very inconvenient.
[0010]
In the prior art 3, a keyword was expanded based on information such as synonyms and synonyms, and a search was performed with little omission in order to perform a search. However, the search speed was slow due to an increase in the number of search keywords. There was a disadvantage of becoming.
[0011]
The present invention has been made in view of the above points, and eliminates search omissions when performing a search using natural sentences, greatly improves operability, minimizes search processing, and increases search speed. It is a first object of the present invention to provide an information search device, an information registration device, an information search method, and a computer-readable program that can realize the above.
[0012]
Further, the present invention saves the memory required for search data and internal processing, and enables an information search device, an information registration device, and a search process for search data and an additional registration process to the search data to be executed at high speed. A second object is to provide an information retrieval method and a computer-readable program.
[0013]
In addition, the present invention provides an information search device, an information registration device, an information search method, and a computer that can obtain information desired by a searcher more quickly even when a large number of search results exist, thereby improving operability. A third object is to provide a readable program.
[0014]
[Means for Solving the Problems]
In order to achieve the above object, the present invention relates to an information retrieval apparatus for retrieving information, which extracts information to be a keyword from information to be registered at the time of information registration, and similarizes the extracted information to be a keyword. Registration processing means for performing normalization for grouping together information, creating search data with the normalized information as a heading, and extracting information serving as a keyword from a search condition when searching for information. Performing normalization to summarize information that is a keyword with similar items, searching the search data based on the normalized information, and comparing the searched content with the search condition to determine validity. And a search processing means for determining and creating a search result.
[0015]
The present invention also relates to an information search method for searching for information, in which, at the time of registering information, a key word is extracted from information to be registered, and the extracted key word information is grouped by similar items. To generate search data with the normalized information as a heading, extract information that is a keyword from search conditions when searching for information, and compare the extracted information that is a keyword with similar ones. Performing the normalization summarized in the above, searching the search data based on the normalized information, determining the validity by comparing the searched content with the search condition, and creating a search result. It is characterized by.
[0016]
In addition, the present invention performs, when registering information, a computer, extracts information to be a keyword from information to be registered, normalizes the extracted information to be a keyword together with similar items, and performs the normalization. A function of creating search data with the performed information as a heading, and, when searching for information, performing information normalization for extracting information that is a keyword from search conditions and combining the extracted information that is a keyword with similar items. A computer-readable program for executing a function of searching the search data based on the normalized information, comparing the searched content with the search condition to determine validity, and generating a search result. It is a featured program.
[0017]
Further, the present invention provides an information registering apparatus, wherein information which is a keyword is extracted from information to be registered, and the extracted information which is a keyword is normalized to be similar to each other, and the normalized information is obtained. It is characterized by having a registration processing means for creating search data with a heading of.
[0018]
Further, in the present invention, in the information search device, information serving as a keyword is extracted from a search condition, normalization is performed to combine the extracted information serving as keywords with similar items, and based on the information obtained by performing the normalization. A search processing unit that searches the search data, compares the searched content with the search condition to determine validity, and creates a search result.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0020]
[First Embodiment]
FIG. 1 is a block diagram illustrating a configuration of an information processing device (information search device) according to the first embodiment of the present invention. The information processing device manages multimedia information such as documents and images, and enables a searcher to set search conditions and obtain necessary files from the managed multimedia information. The apparatus includes a device 1, a CPU 2, an output device 3, and a storage device 4.
[0021]
The input device 1 includes a keyboard and the like, and is used for inputting search conditions (query). The CPU 2 is a central processing unit that controls each unit of the information processing apparatus. The CPU 2 uses a processing program 4-1 developed on the storage device 4 to execute processing described below with reference to FIG. 4 (search document registration processing), FIG. The process shown in each flowchart of FIG. 6 (standard notation acquisition process) is executed. The output device 3 includes a display or the like, and displays and outputs a search result. The storage device 4 includes a memory, a hard disk, and the like, and stores a processing program 4-1, standard word data 4-2, search data 4-3, and document ID management information described later.
[0022]
The search condition (query) input from the input device 1 such as a keyboard is processed by the CPU 2 by the processing program 4-1 developed on the storage device 4 such as a memory or a hard disk. The processing program 4-1 refers to the standard word data 4-2 also stored in the storage device 4, changes the extracted keyword to a standard word, sets the standard word as a heading, and obtains information on the document and words included in the document. From the search data 4-2 storing the features, the similarity of the document to the search condition input from the input device 1 is calculated and acquired, and the search result is output to the output device 3 such as a display.
[0023]
The operating environment related to the information search of the present invention can correspond to the single information processing apparatus (computer) shown in FIG. 1 as well as a local network environment shown in FIG. The Internet environment shown in FIG. The local network environment shown in FIG. 2 is an example in which a server 21 and a plurality of clients 22, 23, 24, and 25 are connected via a LAN 26. Each of the clients 22 to 25 performs information retrieval by using the server 21. . The Internet environment in FIG. 3 is an example in which the terminal 31 is connected to the Internet 32, and the terminal 31 performs information search by connecting to the Internet 32.
[0024]
Next, an operation related to information retrieval in the information processing apparatus according to the first embodiment configured as described above will be described in detail with reference to FIGS. FIGS. 4 to 6 are flowcharts showing the flow of each process according to the first embodiment. 4 to 6 are executed by the CPU 2 of the information processing apparatus according to the processing program 4-1 developed on the storage device 4.
[0025]
<Search document registration process>
First, a search document registration process according to the first embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing the search document registration process. In step S101, the searcher specifies a document to be registered from the input device 1 of the information processing device. When specifying a document to be registered, it is possible to specify one document or a plurality of documents. In step S102, the CPU 2 of the information processing apparatus assigns a unique document ID to the document specified in step S101. FIG. 7 is a diagram illustrating an example of the document ID assignment process. The CPU 2 of the information processing apparatus assigns a document ID that has not been assigned to the document A specified in step S101 with reference to the document ID management information that manages the use of the document ID.
[0026]
In the example of the first embodiment, in the document ID management information of FIG. 7, document IDs = 1 to 10 have already been assigned, but since document ID = 11 has not been assigned yet, the CPU 2 of the information processing apparatus , The document A is given an ID of 11. At the same time, the CPU 2 of the information processing apparatus updates the document ID management information by storing that the ID (11) has been used in the document ID management information. As a result, the document A can be handled as the document ID (11) in the information processing apparatus (system).
[0027]
In step S103, the CPU 2 of the information processing apparatus extracts a keyword to be registered in the document by morphological analysis or the like.
[0028]
FIG. 8 is a diagram illustrating an example of the document analysis process. FIG. 8 illustrates a process of extracting a keyword from the analytic sentence of the content of the document A, "The presidential election was voted in the United States." Morphological analysis can be performed on the parsed sentence to divide it into morpheme units. That is, it can be divided into morpheme units represented by the word segmentation results in FIG. The CPU 2 of the information processing device extracts a word that can be a keyword from the morpheme. There are various methods for extracting a keyword, such as extraction of an independent word as a keyword or determination based on part of speech. In the first embodiment, an independent word is extracted as a keyword. As a result, the morpheme in the extracted keyword in FIG. 8 is extracted as a keyword.
[0029]
In step S104, the CPU 2 of the information processing device acquires a standard notation of the extracted search keyword (a notation of a representative word from the words included in the word group) and registers it as a standard word. Details of the standard notation acquisition processing will be described later using the keyword standardization processing of FIG.
[0030]
FIG. 9 is a diagram showing an example in which standard notation data is listed. In the case of many words, there are a plurality of words for expressing the same meaning, such as inflected forms, different notations (differences in hiragana, katakana, and kanji), fluctuations in notations (differences in kana), synonyms, and synonyms. The standard notation data is data for converting these words into one standard notation based on a certain standard. For example, according to FIG. 9, "USA", "USA", "USA", "USA", "America", "AMERICA", etc. can all be represented by the standard notation "USA". In addition, verbs, adjectives, and the like can be utilized in advance, such as continuous use, termination, union, assumption, and command, all of which are expressed as final forms. In the example of FIG. 9, “do” and “do” are different notations, “America” and “America Rikka” are synonyms, and “president” and “president” are synonyms. "Visit Japan" and "Visit Japan" are synonyms.
[0031]
FIG. 6 is a flowchart showing the standard notation acquisition processing. In step S301, the searcher inputs the keyword extracted in step S103 of FIG. 4 from the input device 1 of the information processing device. Therefore, in the sentence of the first embodiment, “USA”, “President”, “election”, “voting”, and “go” are the target keywords. In step S302, the CPU 2 of the information processing apparatus searches for the standard notation data by using the respective keywords. In step S303, the CPU 2 of the information processing device acquires the standard notation data stored in the storage device 4. For words having no search result (no standard notation), the standard notation data is acquired using the word itself as the standard notation.
[0032]
FIG. 10 is a diagram showing an example of acquiring the standard notation from the input notation with reference to the standard notation data of FIG. As a result, "USA", "President", "Election", "Counting", and "Perform" can be acquired as standard notations.
[0033]
Returning to FIG. 4 described above, in step S105, the CPU 2 of the information processing device creates search data with each standard notation as a heading.
[0034]
FIG. 11 is a diagram showing search data before registration according to the first embodiment. In the search data according to the first embodiment, a document ID that appears with a standard notation as a heading is stored. For example, the search data expresses that the word “America” appears in 1, 2, 3, 4, 5, 8, and 10 documents. If it exists as a headword, a document ID is added to the data. If there is no headword, new data is created using the headword and the document ID.
[0035]
FIG. 12 is a diagram showing an example of registration in search data. In FIG. 12, since "USA" already exists as a heading with reference to the search data, the information of the document ID = 11 to be registered this time is added to the document ID information corresponding to the heading of "USA". On the other hand, since "opening" does not exist as a heading, a new heading "opening" is newly registered with the document ID = 11 to be registered this time.
[0036]
FIG. 13 is a diagram showing search data after registration according to the first embodiment. In FIG. 13, the existing index data stores the ID (11) of the document added this time. Also, in FIG. 13, “Billing” which did not exist in FIG. 11 is newly registered as a new heading.
[0037]
<Document search processing>
Next, a document search process according to the first embodiment will be described with reference to FIG. FIG. 5 is a flowchart showing the document search process. In step S201, the searcher inputs a search condition from the input device 1 of the information processing device as either a natural sentence or a keyword logical expression. FIG. 14 shows an example of a search condition input in the first embodiment. It is possible to search by either a search condition input by a natural sentence of "President of the United States" or a search condition input by a logical expression of [USA] AND [President]. When the search condition is input as a natural sentence, step S202 is executed.
[0038]
In step S202, the CPU 2 of the information processing device extracts a search keyword from the natural sentence search condition input in step S201. Here, the search keyword extraction of FIG. 15 will be described. FIG. 15 is a diagram illustrating a process of extracting a search keyword from the sentence “President of the United States” input in the search condition. Morphological analysis can be performed on the search condition sentence to divide it into morpheme units. That is, it can be divided into morpheme units represented by the word segmentation results in FIG. The CPU 2 of the information processing device extracts a word that can be a keyword from the morpheme. Since it is necessary to use the same extraction method as the above-described document registration processing as a method for extracting a keyword, independent words are extracted as keywords as in the document registration processing. As a result, two morphemes of “USA” and “President” in the extracted keywords of FIG. 15 are extracted as keywords.
[0039]
In step S203, the CPU 2 of the information processing device acquires a standard notation for the extracted keyword and sets it as a search keyword. Details of the standard notation acquisition processing will be described using the keyword standardization processing of FIG.
[0040]
In step S301 in FIG. 6, the searcher inputs the keyword extracted in step S103 in FIG. 4 from the input device 1 of the information processing device. Therefore, in the sentence of the first embodiment, "America" and "President" are the target keywords. In step S302, the CPU 2 of the information processing apparatus searches for the standard notation data of the input keyword with reference to the standard notation data of FIG. In step S303, the CPU 2 of the information processing device acquires the standard notation data stored in the storage device 4. For words having no search result (no standard notation), the standard notation data is acquired using the word itself as the standard notation. As a result, as shown in FIG. 16, "USA" and "President" are acquired as standard notation data for each keyword.
[0041]
Next, returning to FIG. 5 described above, in step S204, the CPU 2 of the information processing device creates a search condition based on the standard notation obtained in the standard notation acquisition processing and searches for search data. The search data is composed of a heading area and a data area as shown in FIG. That is, the standard notation is used as the heading information, and information of the document ID in which the heading information appears is stored in the data area. FIG. 17 is a diagram showing an example in which "USA" and "President" are respectively searched and document ID information is obtained. The word representing the standard notation "USA" is stored in the search document IDs "1, 2, 3, 4, 5, 8, 10, 11", respectively. Similarly, the word representing "President" is stored in the search document ID. It is searched that it is stored in the ID “1, 2, 3, 4, 5, 6, 9, 11”.
[0042]
In step S205, the CPU 2 of the information processing apparatus determines whether or not a result obtained by searching for the search data can be a search result. The inputted search condition "President of the United States" is finally processed as [USA] AND [President] as shown in FIG. Therefore, {1,2,3,4,5,8,10,11} AND {1,2,3,4,5,6,9,11}, and the document ID satisfying this search condition is shown in FIG. Are determined as 1, 2, 3, 4, 5, and 11. In this determination processing, in addition to the description in the first embodiment, information desired to be used as a criterion, such as the part of speech of a word or the relationship between words, is registered in advance in data, so that it can be freely combined and used. It is possible.
[0043]
For example, at the time of registration, search data in which the part of speech of each word that is a keyword is registered along with the document ID is created, and at the time of search, the match / mismatch of the input word part of speech is referenced with the document ID, and the search result is obtained. May be determined.
[0044]
In addition, at the time of registration, search data in which information such as ancillary word information and suffix / prefix information associated with each keyword word is registered together with the document ID, and the information associated with the keyword is created at the time of search. May be added to the determination of the validity of the search result.
[0045]
At the time of registration, search data in which the appearance position of each word as a keyword and the appearance state of a title, a headline, an itemized list, and the like are registered together with the document ID, and at the time of search, the positional relationship and the distance of the keyword are registered. Alternatively, the appearance state may be added to the determination of the validity of the search result.
[0046]
Also, at the time of registration, the sentence structure is extracted together with the document ID, and search data in which information such as the word's destination and destination is registered is created. At the time of search, the search condition is registered in the dependency relationship and the search data. The dependency relation of the word may be added to the determination of the validity of the search result.
[0047]
In addition, at the time of registration, the above information (information indicating the part of speech of each keyword word, information such as attached word information and suffix / prefix information attached to each keyword word, each keyword word) Information indicating the appearance position such as the title, headline, bullet point, etc., and the sentence structure are extracted, and any combination or all of the word's involvement and recipients are registered as search data. All the information included in the search data may be referred to and added to the determination of the validity of the search result.
[0048]
In step S206, the CPU 2 of the information processing device displays and outputs the document as a search result to the output device 3 such as a display. FIG. 19 is a diagram illustrating an example of a search result. With respect to the search condition of "American President", words that are not included in the search conditions, such as the United States, the United States, and the President, can be obtained without expansion using synonyms and synonyms. On the other hand, an example of a document excluded from the search result is shown in the excluded document in FIG.
[0049]
Also, if the search condition sentence is “President of the United States”, “US” and “President” are cut out as keywords in the keyword extraction processing. , Because “President” is standardized to “President”, the final search condition is [USA] AND [President], which is exactly the same as when the above-mentioned “American President” was entered and searched. You can get search results.
[0050]
As described above, according to the first embodiment, when a document is registered, a unique document ID is assigned to the document to be registered, a keyword word is extracted from the document, and a different notation, a variant character, Search data that refers to word group data that combines multiple words having a specific meaning, such as synonyms and synonyms, into one or more words, assigns standard notations to extracted keywords, and registers document IDs with standard notations as headings. Create When searching for a document, a keyword word is extracted from specified search conditions, word group data is referred to, a standard notation is assigned to the extracted keyword, search data is searched based on the standard notation, and a search is performed. It is determined whether the search result matches the search condition, and a final search result is created.
[0051]
As a result, it is possible to eliminate the omission of a search when performing a search based on a natural sentence, which is a problem in the related art. In addition, since the burden on the searcher when inputting a logical expression can be reduced, operability can be greatly improved. Furthermore, in order to prevent search omission, keywords are not expanded using synonyms or synonyms, so that the search processing can be minimized, and the effect of increasing the search speed can be obtained.
[0052]
[Second embodiment]
An information processing apparatus (information retrieval apparatus) according to the second embodiment of the present invention manages multimedia information such as a document and an image, and manages the multimedia information as in the first embodiment. A searcher can set a search condition and obtain necessary files from media information, and includes an input device 1, a CPU 2, an output device 3, and a storage device 4. Since the details of each unit of the information processing apparatus have been described above, the description is omitted. It should be noted that also in the second embodiment, the operating environment related to the information search of the present invention may correspond to the local network environment in FIG. 2 and the Internet environment in FIG. it can.
[0053]
In the second embodiment, a method is adopted in which all words subjected to morphological analysis in the first embodiment are replaced with IDs, and processing is performed using word IDs in the information processing apparatus (system).
[0054]
FIG. 20 is a diagram showing an example in which a keyword is converted into a standard word ID. According to FIG. 20, "USA", "USA", "America", "USA", "USA", and "AMERICA" are all processed in the information processing apparatus (system) as word ID = 1. If "U.S. voted for presidential election." Is registered, the word IDs obtained by referring to the standard notation data are: United States = 1, President = 2, Election = 5, Vote = 4, Vote = It becomes 3. Further, a document ID (11) is additionally registered with each word ID as a heading. As a result, the search data is as shown in FIG.
[0055]
Next, when the search condition “President of the United States” is input, “America” and “President” are extracted as keywords. With reference to the standard word ID data of FIG. 20, ID = 1 and ID = 2 can be acquired as standard word IDs. As a result, the search data can be searched, and a document ID = {1, 2, 3, 4, 5, 11} can be obtained as a search result, and the same result as in the first embodiment can be obtained by a word ID. It is possible to get.
[0056]
As described above, according to the second embodiment, words are handled in the information processing apparatus (system) not by character strings but by IDs, thereby saving search data and memory required for internal processing. In addition to this, it is possible to obtain an effect that the processing can be executed at high speed in the search processing of the search data and the additional registration processing to the search data.
[0057]
[Third Embodiment]
The information processing apparatus (information retrieval apparatus) according to the third embodiment of the present invention manages multimedia information such as a document and an image, and manages the A searcher can set a search condition and obtain necessary files from media information, and includes an input device 1, a CPU 2, an output device 3, and a storage device 4. Since the details of each unit of the information processing apparatus have been described above, the description is omitted. It should be noted that also in the third embodiment, the operating environment related to the information search of the present invention may correspond to the local network environment in FIG. 2 and the Internet environment in FIG. it can.
[0058]
In the third embodiment, the data to be registered (stored) in the word index in the first and second embodiments is only the document ID. (The number of occurrences) is stored together with the document ID. Therefore, the search data is stored as (document ID / appearance count).
[0059]
FIG. 22 is a diagram showing search data registered as described above (document ID / number of appearances). For example, 1/4 in the document ID data of "USA" indicates that "USA" appears four times in the document of document ID = 1. Similarly, it can be seen from the document ID data that “President” appears three times in the document with the document ID = 1. Based on this data, the importance of each keyword in the document is calculated by a known tf · idf method or the like, and a score can be given as the validity of the document based on the result. This makes it possible to quickly present search results that are considered to be highly relevant to the searcher, as in the search result output example shown in FIG.
[0060]
As described above, according to the third embodiment, since information can be quickly presented to a searcher from a valid document in an information search process, even when a large number of search results exist, the search can be performed. This makes it possible to obtain information that the user wants to search more quickly, and to obtain an effect of improving operability.
[0061]
[Other embodiments]
In the first to third embodiments, the case where a document is registered and searched by the information processing apparatus (information search apparatus) has been described. However, the present invention is not limited to this. The present invention can also be applied to the case of registering and searching.
[0062]
In the first to third embodiments, the supply form of the processing program to be executed by the information processing apparatus (information retrieval apparatus) is not limited to a specific one, and the processing program is stored in advance in the information processing apparatus. Configuration, or a configuration in which the processing program is supplied from outside the information processing apparatus.
[0063]
In the first to third embodiments, the information processing apparatus (information search apparatus) is configured as shown in FIG. 1. However, the present invention is not limited to this. An apparatus may be connected, and the search result may be displayed and output by the output device, and may be printed and output by the printing device.
[0064]
Further, the present invention may be applied to a system including a plurality of devices or to an apparatus including a single device. A medium such as a storage medium storing software program codes for realizing the functions of the above-described embodiments is supplied to a system or an apparatus, and a computer (or CPU or MPU) of the system or apparatus stores the medium in the medium such as a storage medium. It goes without saying that the present invention is also achieved by reading and executing the program code thus set.
[0065]
In this case, the program code itself read from a medium such as a storage medium realizes the function of the above-described embodiment, and the medium such as a storage medium storing the program code constitutes the present invention. .
[0066]
Further, as a medium such as a storage medium for supplying the program code, for example, a floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW, magnetic tape, nonvolatile memory card, ROM, and the like can be used.
[0067]
When the computer executes the readout program code, not only the functions of the above-described embodiments are realized, but also an OS or the like running on the computer performs actual processing based on the instruction of the program code. It goes without saying that the present invention includes a case where some or all of the processes are performed and the functions of the above-described embodiments are realized by the processing.
[0068]
Furthermore, after the program code read from a medium such as a storage medium is written to a memory provided in a function expansion board or a function expansion unit connected to the computer, based on an instruction of the program code, It is needless to say that the present invention includes a case where a CPU or the like provided in the function expansion board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0069]
【The invention's effect】
As described above, according to the present invention, at the time of registering information, information that is a keyword is extracted from the information to be registered, and the extracted information that is the keyword is grouped together with similar items. Creates search data with the information that has been subjected to the search as the heading, extracts information that will be keywords from the search conditions when searching for information, and performs normalization to combine the extracted information that is the keywords with similar items. The search data is searched based on the converted information, the searched contents are compared with search conditions to determine validity, and search results are created. It is possible to eliminate search omission when performing a search by a sentence. In addition, since the burden on the searcher when inputting a logical expression can be reduced, operability can be greatly improved. Furthermore, in order to prevent search omission, keywords are not expanded using synonyms or synonyms, so that the search processing can be minimized, and the effect of increasing the search speed can be obtained.
[0070]
In addition, since words are handled not as character strings but as IDs in the information search device, it is possible to save search data and memory required for internal processing, and to perform search processing of search data and search data. In the additional registration process, the effect that the process can be executed at high speed can be obtained.
[0071]
In addition, since information can be quickly presented to a searcher from a highly valid document in an information search process, even when a large number of search results exist, it is possible to obtain information desired by the searcher more quickly, and to improve operability. The effect of improvement can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of an information processing apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of an operation environment.
FIG. 3 is a diagram illustrating an example of an operation environment.
FIG. 4 is a flowchart illustrating a flow of a search document registration process.
FIG. 5 is a flowchart illustrating a flow of a document search process.
FIG. 6 is a flowchart showing the flow of a standard notation acquisition process for standardizing a keyword.
FIG. 7 illustrates an example of a document ID assignment process.
FIG. 8 is a diagram illustrating an example of a document analysis process for extracting a keyword by performing a document analysis.
FIG. 9 is a diagram showing an example of data in standard notation.
FIG. 10 is a diagram illustrating an example of acquiring a standard notation.
FIG. 11 is a diagram illustrating an example of registration data before execution of registration.
FIG. 12 is a diagram showing an example of registration in search data.
FIG. 13 is a diagram showing an example of search data after execution of registration.
FIG. 14 is a diagram showing an input example of a search condition.
FIG. 15 is a diagram illustrating an example of extracting a keyword from a search condition.
FIG. 16 is a diagram illustrating an example of obtaining a standard notation of a search keyword.
FIG. 17 is a diagram illustrating a search example of search data.
FIG. 18 is a diagram illustrating an example of a determination as to whether or not a search result is acceptable.
FIG. 19 is a diagram illustrating an example of a search result.
FIG. 20 is a diagram showing an example of standard word ID data according to the second embodiment of the present invention.
FIG. 21 is a diagram illustrating an example of search data.
FIG. 22 is a diagram showing an example of search data according to the third embodiment of the present invention.
FIG. 23 is a diagram showing an output example of a search result.
FIG. 24 is a diagram showing a document search process according to the related art 1.
FIG. 25 is a diagram illustrating a document search process according to the related art 2.
FIG. 26 is a diagram illustrating a document search process according to the conventional technique 3.
[Explanation of symbols]
1 Input device
2 CPU (registration processing means, search processing means, ID provision means, registration word extraction means, search word extraction means, registration time information provision means, search information provision means, search information registration means, search means, search result availability Judgment means)
3 Output device
4 Storage device
4-1 processing program (registration processing means, search processing means, ID assignment means, registration word extraction means, search word extraction means, registration information addition means, search information addition means, search information registration means, search means, Search result availability determination means)
4-2 Standard notation data (word group information)
4-3 Search data

Claims

An information search device for searching information,
At the time of registering information, information to be a keyword is extracted from the information to be registered, and the extracted information to be a keyword is normalized by similar items, and a search is performed using the normalized information as a heading. Registration processing means for creating data;
At the time of searching for information, information serving as a keyword is extracted from search conditions, normalization is performed to combine the extracted information serving as keywords with similar items, and the search data is obtained based on the normalized information. An information search apparatus comprising: a search processing unit configured to search, compare the searched content with the search condition, determine validity, and create a search result.

The information includes a document or a document accompanied by an image,
The registration processing means includes an ID assigning means for assigning a unique document ID to a document to be registered, a registration word extracting means for extracting a word serving as a keyword from the document, a different notation, a variant character, a synonym, and a synonym. A word group data for grouping a plurality of words having a specific meaning into a predetermined unit, and a registration-time information providing means for providing word group information to the extracted keyword; and a document ID having the word group information as a heading. Search information registration means for creating search data in which
The search processing means includes a search time word extraction means for extracting a word serving as a keyword from a specified search condition, and a search time information providing means for referring to the word group data and adding the word group information to the extracted keyword. Means, a search means for searching the search data based on the word group information, and a search for determining whether or not the search result matches the search condition to create a final search result 2. The information search device according to claim 1, further comprising a result availability determination unit.

3. The information retrieval apparatus according to claim 2, wherein the word group information is information in which a notation of a representative word among words included in the word group is a standard notation.

3. The information retrieval apparatus according to claim 2, wherein the word group information is word group ID representing a word group.

The search information registration unit creates search data in which appearance number information indicating the number of appearances of the word group information is registered together with the document ID, and the search result availability determination unit refers to the occurrence number together with the document ID. 5. The information search device according to claim 2, wherein a determination is made as to whether the search result is acceptable.

The search information registration unit creates search data in which part of speech information indicating the part of speech of each word that has become a keyword is registered together with the document ID, and the search result availability determination unit is configured to search for a word input together with the document ID. The information retrieval apparatus according to claim 2, wherein the determination of the relevance of the search result is performed by referring to the match / mismatch of the parts of speech.

The search information registering means creates search data in which auxiliary information indicating suffix / prefix information or the like is attached to each word that is a keyword together with the document ID, and determines whether the search result is acceptable. 5. The information search apparatus according to claim 2, wherein the means adds a match / mismatch or similarity of the information attached to a keyword to a determination of validity of a search result.

The search information registering unit creates search data in which appearance position information indicating appearance positions of titles, headings, bullet points, and the like of each word that has become a keyword together with the document ID is registered, and the search result 5. The information search device according to claim 2, wherein the availability determination unit adds the positional relationship, the distance, and the appearance state of the keyword to the determination of the validity of the search result.

The search information registering means extracts a sentence structure together with the document ID and creates search data in which word information indicating a destination or a destination of a word is registered, and the search result availability determining means determines whether or not the search condition is satisfied. 5. The information retrieval apparatus according to claim 2, wherein the dependency relation and the dependency relation of the words registered in the search data are added to the determination of the validity of the search result.

The search information registration unit registers the appearance number information, the part of speech information, the accompanying information, the appearance status information, any combination or all of the word information as search data, and the search result availability determination unit includes: 10. The information search apparatus according to claim 5, wherein the information included in the search data is referred to and added to a determination of validity of a search result.

11. The system according to claim 1, wherein the system is applicable to a single information processing device, a system in which the information processing device is connected to a local network, or a system in which the information processing device is connected to the Internet. Information retrieval device.

An information search method for searching for information,
At the time of registering information, information to be a keyword is extracted from the information to be registered, and the extracted information to be a keyword is normalized by similar items, and a search is performed using the normalized information as a heading. Create data,
At the time of searching for information, information serving as a keyword is extracted from search conditions, normalization is performed to combine the extracted information serving as keywords with similar items, and the search data is obtained based on the normalized information. An information search method comprising: searching, comparing the searched content with the search condition, determining validity, and creating a search result.

The information includes a document or a document accompanied by an image,
At the time of the registration, a unique document ID is assigned to the document to be registered, a word serving as a keyword is extracted from the document, and a plurality of words having a specific meaning such as a different notation, a variant, a synonym, or a synonym are determined. Referring to the word group data to be grouped into units, adding word group information to the extracted keywords, creating search data in which the document ID is registered with the word group information as a heading,
At the time of the search, a word serving as a keyword is extracted from a designated search condition, the word group data is referred to, the word group information is added to the extracted keyword, and the search data is obtained based on the word group information. 13. The information search method according to claim 12, wherein a search is performed to determine whether or not the search result matches the search condition, and a final search result is created.

14. The information search method according to claim 13, wherein the word group information is information in which a notation of a representative word among words included in the word group is a standard notation.

14. The information search method according to claim 13, wherein the word group information is information that is a word group ID representing a word group.

At the time of registration, search data in which appearance number information indicating the number of appearances of the word group information is registered together with the document ID is created. At the time of the search, the number of appearances is referred to along with the document ID to determine whether the search result is possible. 16. The information search method according to claim 13, wherein the information search is performed.

At the time of the registration, search data is created in which part of speech information indicating the part of speech of each word that has become a keyword is registered together with the document ID. At the time of the search, a match / mismatch of the word part of speech input together with the document ID is referred to 16. The information search method according to claim 13, further comprising determining whether the search result is acceptable.

At the time of the registration, search data is created in which additional information indicating the attached word information and suffix / prefix information, etc., associated with each word that has become a keyword together with the document ID is registered. 16. The information search method according to claim 13, wherein a match / mismatch or similarity of the information is added to a determination of validity of a search result.

At the time of the registration, search data in which appearance positions indicating the appearance positions of titles, headings, bullet points, and the like, and appearance states of titles, headings, and bullets are registered together with the document ID. 16. The information search method according to claim 13, wherein a positional relationship, a distance, and an appearance state are added to the determination of the validity of the search result.

At the time of the registration, a sentence structure is extracted together with the document ID, and search data in which word information indicating a destination and a destination of the word is registered is created. At the time of the search, the dependency relationship between the search condition and the search 16. The information search method according to claim 13, wherein a dependency relationship between words registered in the data is added to a determination of validity of the search result.

At the time of the registration, any combination or all of the appearance number information, the part of speech information, the accompanying information, the appearance status information, and the word information are registered as search data, and are included in the search data at the time of the search. 21. The information search method according to claim 16, wherein the information is referred to and added to the determination of the validity of the search result.

In the computer, at the time of registering the information, information to be a keyword is extracted from the information to be registered, and the extracted information to be a keyword is normalized by grouping the similar information together, and the information obtained by the normalization is identified as a heading. A function of creating search data that has been extracted, and, at the time of searching for information, extracting information that is a keyword from the search conditions, normalizing the information that is the extracted keyword into a group of similar items, and performing the normalization. A computer-readable program for executing a function of searching the search data based on the retrieved information, comparing the searched content with the search condition to determine validity, and generating a search result.

A registration that extracts information that is a keyword from information to be registered, performs normalization to combine the extracted information that is a keyword with similar items, and creates search data that uses the normalized information as a heading. An information registration device comprising processing means.

Information that becomes a keyword is extracted from search conditions, normalization is performed to combine the extracted information that is a keyword together with similar items, and the search data is searched based on the normalized information. An information search device, comprising: a search processing unit that determines validity by comparing the searched content with the search condition and creates a search result.