JP2004506960A

JP2004506960A - Probability Matching Engine

Info

Publication number: JP2004506960A
Application number: JP2001564037A
Authority: JP
Inventors: ジャロ，マシュー・エイ
Original assignee: バリティー・テクノロジー・インコーポレーテッド
Priority date: 2000-02-28
Filing date: 2001-02-28
Publication date: 2004-03-04
Also published as: WO2001065416A3; CA2401170A1; WO2001065416A2; AU2001243337A1

Abstract

方法及び装置は、情報が、蓋然論アプローチと幾らかの照会処理とに基づく電子データベースから検索されることを可能とする。１形態において、データベースのレコードは、レコードに対する索引が生成される前に、パターン動作言語を用いてレコード・トークンに分解される。もう１つの形態において、索引トークンのテーブルは、生成され、そのテーブルは、各索引トークンに対するデータベース内の発生頻度を含み、各索引トークンは、それぞれのレコード・トークンに対して音声等価物を含んでいる。１形態において、照会は、パターン動作言語を使用して照会トークンに解剖され、索引トークンは、照会トークンから発生し、探索トークンは、データベース・レコードにアクセスするために使用される。もう１つの形態において、探索トークンは、参照トークンに対して、又は参照トークン及び探索トークンに類似するとして適格であるトークンに対して音声等価物を含み、探索トークンは、データベース・レコードにアクセスするために使用される。参照トークンに類似するものとしてのトークンの適格さは、その参照トークンと情報理論アルゴルズムを用いるデータベース辞書との比較に基づいている。更なるもう１つの形態において、トークンは、選択され、選択されたトークンは、データベース・レコードにアクセスするために使用され、照会に対する適切さの見込みは、各レコードに対して計算され、最も高い照会の適切さは、継続閾値と比較される。継続閾値を超える場合、それ以上、レコードは、アクセスされず、アクセスされたレコードは、出力される。継続閾値を超えない場合、選択された探索トークンは、利用可能な探索トークンの組から除去され、新たなトークンは、データベース・レコードにアクセスするために選択される。The method and apparatus allow information to be retrieved from an electronic database based on a probabilistic approach and some query processing. In one form, records in the database are decomposed into record tokens using a pattern operating language before an index is generated for the records. In another form, a table of index tokens is generated, the table including a frequency of occurrence in the database for each index token, and each index token including a speech equivalent for a respective record token. I have. In one form, the query is dissected into query tokens using a pattern operation language, index tokens are generated from the query tokens, and search tokens are used to access database records. In another aspect, the search token includes a speech equivalent for the reference token or for a token that qualifies as similar to the reference token and the search token, wherein the search token is for accessing a database record. Used for The eligibility of a token as similar to a reference token is based on a comparison of the reference token with a database dictionary using an information theory algorithm. In yet another form, a token is selected, the selected token is used to access a database record, and the likelihood of relevance for the query is calculated for each record, and the highest query Is compared to a continuation threshold. If the continuation threshold is exceeded, no more records are accessed and the accessed records are output. If the duration threshold is not exceeded, the selected search token is removed from the set of available search tokens and a new token is selected to access the database record.

Description

【０００１】
【発明の属する技術分野】
本発明は、概して、データベース情報の検索技術に関連する。特に、本発明は、照会の拡張を用いるレコード・リンケージ理論に基づいて、データベース情報を検索するものに関連する。
【０００２】
【発明の背景】
区別は、常に明確に切り離されていないが、情報検索は、伝統的に、二つの形式のうち一つ、即ち、閲覧（ｂｒｏｗｓｉｎｇ）又は照会（ｑｕｅｒｙｉｎｇ）に属するものとして分類される。閲覧は、一般に、照会よりも受動的である。閲覧は、ユーザに、簡単な機構（例えばメニュートピック）を介してデータベースに一部にアクセスさせ、次に、それを介して案内することにより、しばしば何段階かの情報検索システム案内を伴い、ユーザに、アクセスした情報を調査させる。ハイパーテキストは、一般に、情報検索への閲覧アプローチをサポートする。ユーザへの要求が少ないとはいえ、閲覧は、必ずしも、大きなデータベースから情報を検索する手法として最も効率の良い手法ではない。
【０００３】
閲覧と比べて、照会は、ユーザに、自分自身に興味を持つ情報を指定させなければならない。照会は、興味のある情報が、データベース言語とマッチする方法で指定された時に、もっぱら成功する。このマッチは、しばしば、照会用語の選択に歩み寄りを必要とする。照会は、ユーザに重荷を負わせるものとして認識でき、特に、ユーザが訓練を受けていなければ、そうであると認識できる。また、照会は、粗末な検索結果をもたらすこともあり得る。照会それ自体は、伝統的に、二つの形式のうち一つ、即ち、ブール検索（Ｂｏｏｌｅａｎ　ｒｅｔｒｉｅｖａｌ）に関する照会、又は蓋然論検索（ｐｒｏｂａｂｌｉｓｔｉｃ　ｒｅｔｒｉｅｖａｌ）に関する照会に属するものとして分類される。
【０００４】
ブール検索に関する照会は、最も確立した形式を持つ情報検索である。それは、ユーザに、興味のある情報及びデータベース言語の双方にマッチする言葉の適用な組み合わせを作成させる必要がある。ブール探索（Ｂｏｏｌｅａｎ　ｓｅａｒｃｈｉｎｇ）は、ユーザに、許容できる数の検索結果を得るために、有限数の用語だけを指定させる必要がある。最適のブール探索は、ユーザに、ブール機能語（ｏｐｅｒａｔｏｒｓ）と用語の効率高い組み合わせとを熟知させる必要がある。それにも拘わらず、ユーザは、めったに、ブール機能語を明確に使用しない。
【０００５】
蓋然論検索に関する照会は、ユーザに、広範囲の検索を提供する。検索結果は一般に、蓋然論に基づくアルゴリズムを使用する照会用語と比較され、それらがその照会用語にどの程度の類似性をもってマッチするかを評価される。データベースにおいて頻度がより少ない用語は、より識別力があると考えられ、マッチを予測する際により多くの重み（ｗｅｉｇｈｔ）が与えられる。ユーザは、自分自身で使用する多くの照会用語に束縛されることはない。なぜならば、検索結果の評価が、過度の検索結果の問題を緩和するからである。
【０００６】
それにも拘わらず、問題は、蓋然論検索法を用いる電子データベースを照会する際に残存する。照会又はデータベースにおいて、つづりの間違えと解釈できないつづりとにより、関連のある情報が、検索処理中に見落とされるかもしれない可能性がある。同様に、照会又はデータベースにおいて、解釈できない情報のフォーマットは、関連のある情報が、見落とされるかもしれない可能性がある。仮に、検索速度が重要である場合、多くの照会用語を指定することは、結果として、不満足な遅い照会応答になってしまう。同時に、仮に、ユーザが、探索結果を待たなければならない場合、ユーザは、挫折するかもしれない。なぜならば、そのデータベースは、他の探索に拘束されているからである。逆に言えば、ユーザは、速い探索が粗末な結果になることにより、挫折するかもしれない。
【０００７】
【発明の概要】
１形態において、本発明は、データベースに索引を付ける方法を含む。データベースのレコードは、入力される。各レコードは、パターン動作言語を使用して、レコード・トークンに解剖される。そのレコードに対する索引は、各レコードに対するそのレコード・トークンから生成される。
【０００８】
１実施形態において、解剖することは、各レコードをオリジナル・トークンに変換することと、各オリジナル・トークンを特定することと、特定されたオリジナル・トークンをパターン動作言語に基づいてレコード・トークンに変換することと、を含む。関連する実施形態において、パターン動作言語は、レコード・トークンに関連するドメインに対して応答する。
【０００９】
もう１つの実施形態において、索引生成は、各レコードに対するレコード・トークンからユニーク・レコード・トークンのリストを生成することと、各ユニーク・レコード・トークンに対するデータベース内の発生頻度を計算することと、索引トークンのテーブルを生成することと、を含む。索引トークンのテーブルは、各ユニーク索引トークンに対するデータベース内の発生頻度を含む。関連する実施形態において、索引トークンは、それぞれのレコード・トークンに対する音声等価物を含む。さらなる関連実施形態において、ユニーク・レコード・トークンのリストもまた、生成される。
【００１０】
もう１つの形態において、本発明は、データベースに索引を付ける方法を含む。データベースのレコードは、入力され、各レコードは、レコード・トークンに解剖される。索引トークンは、それぞれのレコード・トークンから生成される。索引トークンは、レコード・トークンに対する音声等価物である。データベース内の発生頻度は、ユニーク索引トークンに対して計算される。索引トークンのテーブルは、生成される。索引トークンのテーブルは、ユニーク索引トークンに対する発生頻度を含む。
【００１１】
１実施形態において、ユニーク・レコード・トークンもまた、生成される。関連する実施形態において、各レコードは、パターン動作言語を使用してレコードに解剖される。もう１つの関連実施形態において、解剖することは、各レコードをオリジナル・トークンに変換することと、各オリジナル・トークンを特定することと、特定されたオリジナル・トークンをパターン動作言語に基づいてレコード・トークンに変換することと、を含む。
【００１２】
前述の形態のある実施形態において、データベースに対する索引は、各レコードに対するレコード・トークンから生成される。各レコード・トークンは、データベース内のドメインに関連し、パターン動作言語は、ドメインに応答し、発生頻度は、データベース内のドメインに関して計算され、ユニーク・レコード・トークンの索引は、ドメインによる発生頻度のリストを作る。
【００１３】
１形態において、本発明は、データベースに索引を付ける装置に関連する。入力デバイスは、データベースのレコードを受け入れる。解剖部は、レコードをレコード・トークンに解剖し、索引部は、データベース内にレコード・トークンの索引を発生させる。
【００１４】
１実施形態において、分解部は、トークン化部と、トークン特定部と、トークン変換部と、を含む。トークン化部は、レコードをオリジナル・トークンに変換する。トークン特定部は、各オリジナル・トークンを特定し、トークン変換部は、特定されたオリジナル・トークンをパターン動作言語に基づいてレコード・トークンに変換する。関連する実施形態において、パターン動作言語は、レコード・トークンに関連するドメインに応答する。
【００１５】
もう１つの実施形態において、索引部は、トークン比較部と、頻度計算部と、テーブル発生部と、を含む。トークン比較部は、レコード・トークンからユニーク索引トークンのリストを生成する。頻度計算部は、各ユニーク索引トークンに対してデータベース内の発生頻度を計算する。テーブル発生部は、ユニーク索引トークンに対する発生頻度を含むテーブルを発生させる。関連する実施形態において、索引トークンは、それぞれのレコード・トークンに対する音声等価物であり、トークン比較部は、トークン発生部を介して解剖部と通信する。もう１つの実施形態において、レコード・トークン比較部もまた、ユニーク・レコード・トークンのリストを生成する。
【００１６】
もう１つの形態において、本発明は、データベースに索引を付ける装置に関連する。入力デバイスは、データベースのレコードを受け入れ、解剖部は、レコードをレコード・トークンに解剖する。トークン発生部は、それぞれのレコード・トークンから索引トークンを発生させる。索引トークンは、それぞれのレコード・トークンの音声等価物である。テーブル発生部は、頻度発生部により計算された、データベース内の索引トークンの発生頻度と、索引トークンを含むすべてのレコードに対するポインタとを含むテーブルを、各索引トークンに対して発生させる。
【００１７】
１実施形態において、レコード・トークン比較部は、各レコードに対するレコード・トークンからユニーク・レコード・トークンのリストを生成する。関連する実施形態において、テーブル生成部は、ユニーク索引トークンに対応する索引トークンを含むデータベース内の各レコードに対するポインタを含むテーブルを発生させる。もう１つの関連実施形態において、解剖部と通信するレコード・トークン比較部はまた、ユニーク・レコード・トークンのリストを生成する。
【００１８】
更なる関連実施形態において、解剖部は、パターン動作言語を使用して各レコードを解剖する。このような、１実施形態において、解剖部はさらに、トークン化部と、トークン特定部と、トークン変換部と、を含む。トークン変換部は、レコードをオリジナル・トークンに変換する。トークン特定部は、各オリジナル・トークンを特定し、トークン変換部は、特定されたオリジナル・トークンをパターン動作言語に基づいてレコードトークンに変換する。
【００１９】
ある実施形態において、オリジナル・トークンと、それぞれのレコード・トークンと、それぞれの索引トークンのすべてとは、データベース内の同じドメインに関連し、パターン認識は、トークンに関連するドメインに応答し、索引トークンに対する発生頻度は、ドメインによって計算される。１実施形態において、テーブル発生部は、ユニーク索引トークンのための発生頻度と、対応するレコード・トークンを含むデータベース内の各レコードに対するポインタと、を含むテーブルを、発生させる。
【００２０】
１形態において、本発明は、データベースに照会する方法に関連する。照会は、入力され、パターン動作言語を用いて照会トークンに解剖される。探索トークンは、照会トークンから発生する。探索トークンは、データベース内のレコードにアクセスするために索引テーブル上を調べている。
【００２１】
１実施形態において、解剖することは、照会をオリジナル・トークンに変換することと、各オリジナル・トークンを特定することと、特定されたオリジナル・トークンをパターン動作言語に基づいて照会トークンに変換することと、を含む。関連する実施形態において、オリジナル・トークンと、結果として発生する照会トークンとは、データベース内の同じドメインに関連している。さらなる関連実施形態において、パターン動作言語は、トークンに関連するドメインに応答する。
【００２２】
更なる異なる関連実施形態において、探索トークンは、照会トークンから発生する。探索トークンは、それぞれの照会に関連するデータベース内のドメインに関連している。
【００２３】
もう１つの形態において、本発明は、データベースに照会する方法に関連する。照会は、入力され、照会トークンに解剖される。探索トークンは、照会トークンから発生する。探索トークン発生部は、情報理論アルゴリズムに基づいて照会トークンに類似するトークンのためにユニーク・レコード・トークンのリストをチェックする。それはまた、照会トークン及び類似トークンを探索トークンに翻訳することを含んでいる。探索トークンは、照会トークン又は類似トークンに対する音声等価物である。探索トークンは、データベース内のレコードにアクセスするために索引テーブル上で調べられる。
【００２４】
１実施形態において、探索トークンは、それぞれの照会トークンとしてデータベース内の同じドメインに関連している。関連する実施形態において、解剖は、パターン動作言語を使用して実行される。更なる関連実施形態において、解剖することは、照会をオリジナル・トークンに変換することと、各オリジナル・トークンを特定することと、特定されたオリジナル・トークンをパターン動作言語をに基づいて照会トークンに変換することと、を含む。
【００２５】
１形態において、本発明は、データベースに照会する装置に関連する。照会入力デバイスは、入力として照会を受け入れる。解剖部は、その入力をパターン動作言語を使用して照会トークンに解剖する。発生部は、照会トークンから探索トークンを発生させる。データベースアクセス部は、探索トークンに応じてデータベース内のレコードにアクセスする。
【００２６】
１実施形態において、解剖部は、トークン化部と、特定部と、変換部と、を含む。トークン化部は、入力からオリジナル・トークンを発生させ、特定部は、それらの各々を特定する。変換部は、特定されたオリジナル・トークンをパターン動作言語に基づいて照会トークンに変換する。関連する実施形態において、オリジナル・トークンは、それぞれの照会トークンとしてデータベース内の同じドメインと探索トークンとに関連している。もう１つの関連実施形態において。トークンは、データベース内の同じドメインに関連し、パターン動作言語は、それらに関連するドメインに応答する。
【００２７】
もう１つの形態において、本発明は、データベースに照会する装置に関連する。照会入力デバイスは、入力を受け入れ、解剖部は、それを照会トークンに解剖する。発生部は、照会トークンから探索トークンを発生させる。発生部は、情報理論アルゴリズムに基づいて照会トークンに類似するものとして適格のあるトークンを追加する照会拡張部を含む。これらは、類似トークンである。発生部はまた、各照会トークン及び類似トークンを音声等価物たる探索トークンに翻訳する翻訳部を含む。データベースアクセス部は、探索トークンを用いてデータベース内のレコードに対応するポインタを見つけ出す。
【００２８】
１実施形態において、各照会トークンと、それぞれの類似トークンと、それぞれの探索トークンとは、データベース内の同じドメインに関連している。もう１つの実施形態において、解剖部は、パターン動作言語を使用して解剖する。関連実施形態において、解剖部は、トークン化部と、特定部と、変換部と、を含む。
【００２９】
１形態において、本発明は、データベース内のデータにアクセスする方法に関連する。トークンは、探索されるべき第１のトークンとして組から選択される。１組のレコードは、選択されたトークンに応じてデータベースから検索される。照会に対する適切さの見込みは、その組の中の各レコードに対して決定される。１組のレコードは、照会に対する適切さの見込みによって順序づけられる。組に対する最も高い照会の適性さの見込みは、継続閾値と比較される。仮に、閾値を超える場合、探索は、終了させられ、１組のレコードは、出力される。仮に、そうでない場合、異なるトークンは、新たな探索のために選択される。
【００３０】
１実施形態において、照会に対する適切さの見込みは、レコード・リンケージ理論に基づいて決定される。関連する実施形態において、１組のレコードは、１よりも多いレコードから構成され、出力されたレコードは、照会の適切さの見込みによって順序づけられる。
【００３１】
もう１つの実施形態において、データベース内の発生頻度は、各トークンに対して見極められ、トークンは、発生頻度によって順序づけられ、もっとも低い発生頻度を有するトークンは、第１の探索トークンとして選択される。仮に、継続閾値を超えない場合、次に最も低いトークンを持つトークンは、次の探索トークンとして選択される。関連する実施形態において、発生頻度は、データベース内のドメインに関し、トークンは、ドメインにそれぞれ関連している。このような実施形態において、トークンは、順序づけられ、関連するドメイン内において最も低い発生頻度を有するトークンは、第１に選択されたトークンである。関連する実施形態において、照会に対する適切さの見込みは、レコード・リンケージ理論に基づいて各レコードに対して決定される。更なる関連実施形態において、仮に、検索されたレコードのバッファがオーバーフローする場合、バッファは、消去され、新たな探索は、すべてのトークンを含むレコードに対して始められる。
【００３２】
もう１つの形態において、本発明は、データベース内のデータにアクセスするための装置に関連する。データベースアクセス部は、トークン選択部により第１のトークンとして探索されるべくトークンとして選択されたトークンに応じて、データベースから１組のレコードを検索する。適切さ決定部は、組の中の各レコード又は複数のレコードに対して照会の適切さの見込みを決定する。適切さ比較部は、適切さの見込みによって組の中の各レコードを順序づけ、閾値比較部は、継続閾値と最も高い適性さの見込みとを比較する。仮に、継続閾値を超える場合、適切さ比較部は、探索を終了する。仮に、そうでない場合、適切さ比較部は、選択されたトークンを排除し、トークン選択部に他のトークンを選択させる。出力デバイスは、閾値比較部が探索を終了した時に、１組のレコードを返す。
【００３３】
１実施形態において、照会に対する適切さの見込みは、レコード・リンケージ理論に基づいて決定される。関連する実施形態において、データベースアクセス部は、１つよりも多いレコードを検索し、出力デバイスは、照会の適切さの見込みによっって順序づけられたレコードを返す。
【００３４】
もう１つの実施形態において、頻度比較部は、各トークンに対してデータベース内の発生頻度を見極め、発生頻度によってトークンを順序づける。トークン選択部は、探索されるべき第１のトークンとして、最も低い発生頻度を有するトークンを選択する。関連する実施形態において、頻度比較部は、トークンに関連するデータベース内のドメイン内における発生頻度を見極め、第１のトークンとして、関連するドメイン内で最も低い発生頻度を有するトークンを選択する。もう１つの関連実施形態において、適切さ決定部は、レコード・リンケージ理論に基づいて照会の適正さの見込みを決定する。更なる関連実施形態において、バッファ・オーバーフロー阻止部は、それがオーバー・フローした時にバッファを消去し、オーバー・フロー信号をトークン選択部に送信する。データベースアクセス部は、その後、すべてのトークンを含むデータベースから１組のレコードを検索する。
【００３５】
【発明の実施の形態】
要約すると、本発明は、図１に示すような電子データベースに対する情報検索処理に関連する。この情報検索処理は、この処理による照会が、データベース内の現存する問い合わせデータにアクセスするために使用される処理である。本発明において、蓋然論は、ユーザの照会に従ってデータベース内のレコードを選択し、それらを検索するために用いられる。情報検索処理は、一般に、図１に示す３つのステップに分離することができる：問い合わせデータに索引を付けるステップと、照会を処理するステップと、問い合わせデータにアクセスするステップ。情報検索処理のうち最後の２つのステップは、探索段階と考えることもできる。
【００３６】
データベースは、一般に、多数のレコードを含み、それらのぞれぞれを、レコード番号によって問い合わせることができる。各レコードは、一般に、数個のドメインを含んでいる。同様に、各ドメインは、数個のフィールドを含んでいる。各フィールドはさらに、自由な形式のテキストを含むことができる。例えば、国内の収入サービス・データベースは、各納税者毎に対して分離したレコードを含むことができる。納税者レコードは、番号をつけられてもよいし、また、納税者の自宅及び勤務先の住所に対して分離したドメインを含んでもよい。各アドレスドメインは、ストリート・フィールドと、タウン・フィールドと、郵便番号フィールドとを含むこともできる。ストリート・フィールドは、例えば、「１０９１０　ウェイ・スルゥ・ザ・ウッズ」、「７１　カミノ・デ・グラシカ」のような自由な形式のテキストを受け入れてもよい。データベースは、一般に、すべてのフィールド又はドメインが情報を含んでいることを要求するものではない。例えば、自由契約の写真家として働く納税者は、勤務先の住所を有していなくともよく、そのため、納税者の勤務先住所ドメインは、いかなるデータをも含んでいない。
【００３７】
他のデータベース配置が可能であり、本発明の情報検索処理は、それらの状況に用意に適用することができる。それにも拘わらず、本出願の目的に鑑み、データベース内の問い合わせデータは、多数のレコードを含み、各レコードが、多数のドメインを含み、各ドメインが、１つ又はそれよりも多いフィールドを含み、各フィールドが、自由な形式のテキストを含む、と想定されている。問い合わせデータの配置が想定されているとともに、本発明は、フィールドに常駐する自由形式テキストで動作する。
【００３８】
要約すると、図１のブロック図に示すように、情報検索処理における第１のステップは、問い合わせデータに索引を付ける（ステップ１０）。問い合わせデータに索引を付けることは、情報検索処理における探索段階に対する問い合わせデータの準備と考えることができる。図１Ａは、本発明の１実施形態に係る、索引処理中におけるレコードの進展を表す。索引が始まると、各レコード４４の要素４２は、１組のレコード・トークン（Ｔ_Ｒｎ）４６に解剖される。いくらかの実施形態における解剖処理は、レコードのある部分を除去し、そのレコードの残りの部分を標準化する。図１Ａに示す実施の形態において、索引トークン（Ｔ_Ｉｎ）６２は、レコード・トークン（Ｔ_Ｒｎ）４６から生成される。
【００３９】
索引が終了すると、索引トークン（Ｔ_Ｉｎ）６２及びレコード・トークン（Ｔ_Ｒｎ）４６は、その後の探索を容易にするために、分析されることとなる。１実施形態において、問い合わせデータに含まれるユニーク・レコード・トークン（Ｔ_Ｒｎ）４６のリストは、生成される。１実施形態において、ユニーク索引トークン（Ｔ_Ｉｎ）６２のテーブル９６は、生成される。関連した実施の形態において、テーブル９６は、各ユニーク索引トークン（Ｔ_Ｉｎ）６２に対するデータベース内の発生頻度（ν_ｎ）を、含む。他の関連した実施の形態において、テーブル９６は、トークンを含む問い合わせデータ内のレコードに対するポインタ９４を、含む。図１Ａに示す実施形態において、利用可能な索引情報のすべてを含む、包括的な１テーブルが存在する。他の実施形態において、利用可能な索引情報の一部を含む、多数のテーブルが存在する。
【００４０】
図１に示す情報検索処理における第２のステップは、照会の処理である（ステップ２０）。照会を処理することは、情報検索処理段階にアクセスする情報の中で使用するための照会の準備である、と考えることができる。図１Ｂは、本発明の１実施形態に係る、照会処理中における照会５４の進展を表す。照会５４の要素５２は、１組の照会トークン（Ｔ_Ｑｎ）５６に解剖される。図１Ｂに示す実施の形態において、解剖処理は、照会５４のある部分を除去し、その照会の残りの部分を標準化する。１実施形態において、情報理論アルゴリズムに基づき照会トークン（Ｔ_Ｑｎ）５６と類似するとして適格となるレコード・トークン（Ｔ_Ｒｎ）４６のリストからの何れのトークンも、１組の照会トークンに追加される。１実施形態において、問い合わせデータ内のレコードにアクセスするために使用できる探索トークン（Ｔ_Ｓｎ）７２は、照会トークン（Ｔ_Ｑｎ）５６と類似トークンとから生成される。ある実施形態において、照会処理は、問い合わせデータ内におけるレコード処理に対応する。
【００４１】
図１に示す情報検索処理における第３のステップは、問い合わせデータへのアクセスである（ステップ３０）。問い合わせデータにアクセスすることは、問い合わせデータ及び照会の準備の頂点である。図１Ｃは、本発明の実施形態に係るアクセス処理を表す。１実施形態において、蓋然論探索モデルと調和して、探索トークン（Ｔ_Ｓｎ）７２は、探索トークンの選択度に基づき１組の探索トークンから選択される。探索トークン（Ｔ_Ｓｎ）７２を含む問い合わせデータからのレコード４４は、トークン・テーブル９６を使用して検索される。１実施形態において、重みは、ユーザ照会５４に適切であるという見込みを表す各レコードから計算される。関連した実施の形態において、重み計算は、レコード・リンケージ理論に基づく。１実施形態において、１組の検索されたレコードに対する最大の重みは、探索を続行すべきか、或いは終了すべきかを決める閾値と比較される。１実施形態において、検索されたレコードは、順序づけられてユーザに返される。ある実施形態において、各レコードの重みは、単独で、又はレコードと関連づけて、ユーザに返される。情報検索処理の最終は、各レコードの照会に対する適切さを評価するために、リスト又はレコードを、及び、ある実施形態においては、重みを有するユーザである。
【００４２】
図２を参照する。図２は、本発明の実施の形態に係る、問い合わせデータに索引を付ける処理の詳細なブロック図を示す。第１のステップは、問い合わせデータの１つのレコード４４を解剖する（ステップ４０）。レコードをトークンへと解剖することは、レコード内のデータを分離して１組のトークンにすることを含む。ある実施形態において、問い合わせデータの開発者は、１単位の文字を定義し、それは、レコードの内容を分離してトークンにするための基礎として使用されることとなる。このような、ある実施形態において、これらの開発者が定義した文字は、単独で使用される。このような、他の実施形態において、これらの開発者が定義した文字は、分離のための基礎としてのデフォルトの文字とともに、使用される。他の実施形態において、開発者は、デフォルトの文字が分離のための単独の基礎として使用されるようにしておく。１群の文字は、分離のための基礎として使用されてもよい。ある実施形態において、分離文字自身が、トークンとなる。
【００４３】
例えば、「ｂｉｇ；ｂａｄ．ｗｏｌｆａｎｄｒｅｄｒｉｄｉｎｇｈｏｏｄ」（「大きい；悪い．狼と赤ずきん」）を含むレコードは、「＜ｂｉｇ＞；＜ｂａｄ＞．＜ｗｏｌｆａｎｄｒｅｄｒｉｄｉｎｇｈｏｏｄ＞」となり、ここで、セミコロンとピリオドは、１単位の分離文字として定義されており、「＜「ａｎｄ」＞」は、トークン境界を示す。同様に、「ｂｉｇ；ｂａｄ．ｗｏｌｆａｎｄｒｅｄｒｉｄｉｎｇｈｏｏｄ」を含むレコードは、「＜ｂｉｇ；ｂａｄ．ｗｏｌｆ＞＜ａｎｄ＞＜ｒｅｄｒｉｄｉｎｇ＞＜ｈｏｏｄ＞」となり、ここで、スペースは、１単位の分離文字として定義されており、「＜「ａｎｄ」＞」は、再びトークン境界を示す。他の実施形態において、分離文字は、分離処理において除去される。ある実施形態において、異なる文字は、異なるフィールド又はドメインにおける分離のための基礎として使用される。
【００４４】
ある実施形態において、レコードを解剖することは、あるトークンを除去することを含む。ある実施形態において、開発者は、１組のトークンを定義し、それは、レコードの内容を分離してトークンにした後、除去されることとなる。１実施形態において、開発者が定義したトークンは、除去される単独のトークンである。もう１つの実施形態において、開発者が定義したトークンは、デフォルトのトークンとともに、除去される。他の実施形態にいて、開発者は、単に、デフォルトのトークンを除去されるようにする。除去されるトークンは、単一の文字からなる必要がない。例えば、「＜ｂｉｇ＞＜；＞＜ｂａｄ＞＜．＞＜ｗｏｌｆａｎｄｒｅｄｒｉｄｉｎｇｈｏｏｄ＞」は、「＜ｂｉｇ＞＜ｂａｄ＞＜ｗｏｌｆａｎｄｒｅｄｒｉｄｉｎｇｈｏｏｄ＞」となり、ここで、セミコロンとピリオドは、除去されるトークンとして定義されている。ある実施形態において、開発者は、異なるトークンを定義し、それは、異なるフィールド又はドメインにおいて除去されることとなる。
【００４５】
ある実施形態において、レコードを解剖することは、パターン用の分離処理に起因する１組のトークンを検査することと、認識されたパターンにおける１つ又はそれよりも多いトークンに従うこととを含む。このような実施形態において、各トークンの属性は、一旦レコードがトークンに分離されると、決定される。１実施形態において、属性は、種類と、長さと、値と、省略と、半文字列とを含む。他の実施形態において、追加属性が、決定される。更なる他の実施形態において、少数の属性が、決定される。何れの実施形態において、トークンのある属性を決めることは、トークンの他の属性を決定する要件を無効にするかもしれない。種類の属性を用いる１実施形態において、種類は、数字と、アルファベットと、１又はそれより多い文字が後に続く先頭の数字と、１又はそれより多い数字が後に続く先頭のアルファベットと、先の２つの種類の何れにも該当しない数字と文字との混合を含む複合混合と、一般には遭遇しない特別なキャラクターを含む特別と、を含む。種類の属性を用いるもう１つの実施形態において、他の種類が、決定される。１実施形態において、アルファベットの種類は、敏感なケースである。ある実施形態において、付加的な開発者が定義した種類は、デフォルトの種類と結合して使用される。他の実施形態において、開発者が定義した種類は、デフォルトの種類を除外して使用される。例えば、１実施形態において、トークン＜ａＢｃｄｅｆ＞は、６文字の長さで、かつ「ａＢｃｄｅｆ」の値で、アルファベットの種類のトークンの属性を持ち、ここで、アルファベットのトークンは、敏感である。
【００４６】
トークンの認識パターンをいくつかの方法で修正することを含む解剖が行われる実施形態において、パターンは、可能なトークン属性に基づく認識のために定義されなければならない。このような、ある実施形態において、パターンは、特定のドメイン内で発生する条件のみで動作するように定義される。このような、他の実施形態において、パターンは、１組のドメイン内の何れかで発生する条件で動作するように定義される。パターン・マッチングは、最初のトークンで始まり、一度に１つのトークンずつ行われる。１つのレコードに対してマルチ・パターン・マッチであってもよい。パターンは、トークン、トークンの一部、又は１組のトークンの何れの属性によって定義されてもよい。１実施形態において、パターンは、単一のトークンの属性によって定義される。もう１つの実施形態において、パターンは、１組のトークンの属性によって定義される。例えば、１実施形態において、パターンは、トークンの長さが１０のよりも短く、トークンの最初の４文字が「ＡＮＴＩ」の半文字列で、かつトークン中の発生している場所を拘束しない「ＣＳ」の半文字列で、定義されている。実施例において、＜ＡＮＴＩＣＳ＞及び＜ＡＮＴＩ−ＣＳＡＲ＞のトークンが、動作に対して認識されるであろう。対照的に、その実施例において、＜ＡＮＴＩＰＡＴＨＹ＞のトークンは、認識されないであろう。なぜならば、第２の半文字列の拘束に出会わないからである。また、＜ＡＮＴＩＰＯＬＩＴＩＣＳ＞のトークンも、認識されないであろう。なぜならば、文字長の拘束に出会わないからである。
【００４７】
トークンの認識パターンをいくつかの方法で修正することを含む解剖が行われる実施形態において、多くの動作は、パターンを修正するために行われる。１実施形態において、認識されるパターンに応じて行われる動作は、パターンの属性の１つを変更することである。もう１つの実施形態において、認識されるパターンに応じて行われる動作は、パターンの一部分を連結することである。更なるもう１つの実施形態において、認識されるパターンに応じて行われる動作は、バッグする情報をプリントすることである。他の実施形態において、他の動作が、行われる。ある実施形態は、１つのトークンの１つの半文字列に関して１つの動作を行う。ある実施形態は、１つの認識されたパターンに応じて多くの動作を行う。例えば、１実施形態において、コマンド「ＳＥＴｔｈｅｖａｌｕｅｏｆ＜ｔｏｋｅｎ＞ｔｏ（１：２）＜ｔｏｋｅｎ＞」は、アルファベットの種類のトークンの長さが７で、かつ最初の２文字が「ＥＸ」の半文字列でのパターン認識を仕上げるために定義される。実施例において、＜ＥＸＡＭＰＬＥ＞のトークンが、パターンにフィットするものとして認識され、コマンドが実行された結果、トークンの値が、オリジナルのトークンの最初の２文字に、又は「ＥＸ」に変化する。他の実施の形態において、通常、探索に有益ではない単語、例えば「ａｔ」、「ｂｙ」、「ｏｎ」などのノイズ単語の値は、０に設定され、その結果、それらは、ユニーク索引トークンのリストから排除される。図１Ａに示すように、解剖することは、データベース・レコード４４をレコード・トークン（Ｔ_Ｒｎ）４６に変換する。
【００４８】
図２に示す問い合わせデータに索引をつける処理における、第２のステップは、ユニーク・レコード・トークンを見極めることである（ステップ５０）。ユニーク・レコード・トークンを見極めることは、ユニーク・レコード・トークンのリストが生成されるようになる。このようなリストは、データベースの用語辞書として記述されてもよい。１実施形態において、あるフィールドは、リストへ寄与することから除外される。もう１つの実施形態において、あるドメインは、リストへ寄与することから除外される。１実施形態において、トークンは、それらの種類に基づくユニーク・トークンのリストへ寄与することから除外される。もう１つの実施形態において、トークンは、それらの種類及びもう１つの属性に基づくユニーク・トークンのリストへ寄与することから除外される。ある実施形態において、排除された種類又は他の属性は、ドメインに関して指定される。ある実施形態において、排除された種類又は他の属性は、全体としてレコードに関して指定される。例えば、１実施形態において、開発者は、すべての数字のトークンで、かつ５文字以上のトークンを、ユニーク・トークンのリストから除外する。もう１つの実施形態において、ステップ５０は、スキップされる。更なるもう１つの実施形態において、ステップ５０は、図２に示す問い合わせデータに索引をつける処理の後、行われる。
【００４９】
図２に示す問い合わせデータに索引をつける処理における、第３のステップは、レコード・トークン（Ｔ_Ｒｎ）４６から索引トークン（Ｔ_Ｉｎ）６２を生成することである（ステップ６０）。また、ステップ６０は、図１Ａに示されている。ある実施形態において、索引トークンは、レコード・トークンそれ自体である。前述の実施形態において、ステップ７０は、ステップ５０の複製物である。他の実施形態において、図１Ａに示すように、索引トークン（Ｔ_Ｉｎ）６２は、レコード・トークン（Ｔ_Ｒｎ）の音声等価物である。これらの実施形態において、索引トークンは、一般に、レコード・トークンを音声言語に翻訳することによって生成される。このような、１実施形態において、音声言語は、ＮＹＳＩＩＳである。このような、もう１つの実施形態において、音声言語は、ＳＯＵＮＤＥＸである。このような、更なる他の実施形態において、音声等価は、別の音声言語又はその変形に基づく。１実施形態において、多数の組の索引トークンがあり、各々が、異なる音声言語又はその変形に基づいている。１実施形態において、アルファベットの種類のレコード・トークンのみが、翻訳され、他の種類のレコード・トークンは、索引トークンを生成するために使用されることはない。もう１つの実施形態において、アルファベットの種類及び他の種類のレコード・トークンは、索引トークンを生成するが、レコード・トークンのアルファベットだけが、翻訳されて索引トークンになる。
【００５０】
図２に示す問い合わせデータに索引をつける処理における、第４のステップは、ユニーク索引トークンの見極めることである（ステップ７０）。ステップ７０は、ステップ５０と、多く類似している。ユニーク索引トークンを見極めることは、ユニーク索引トークンのリストが生成されるようになる。このようなリストは、索引用語の辞書として記述されてもよい。１実施形態において、あるフィールドは、リストへ寄与することから除外される。もう１つの実施形態において、あるドメインは、リストへ寄与することから除外される。１実施形態において、トークンは、その種類に基づくユニーク・トークンのリストへ寄与することから除外される。もう１つの実施形態において、トークンは、その種類及びもう１つの属性に基づくユニーク・トークンのリストへ寄与することから除外される。ある実施形態において、排除された種類及び他の属性は、ドメインに関して指定される。ある実施形態において、排除された種類及び他の属性は、全体としてレコードに関して指定される。例えば、１実施形態において、開発者は、すべてのアルファベットのトークンで、かつ５文字以上のトークンを、ユニーク・トークンのリストから除外する。もう１つの実施形態において、ステップ７０は、スキップされる。更なるもう１つの実施形態において、ステップ７０は、ステップ８０の後に行われる。もう１つの実施形態において、ステップ７０は、ステップ８０の一部として行われる。
【００５１】
図２に示す問い合わせデータに索引をつける処理における、第５のステップは、追加的なレコードをチェックすることである（ステップ８０）。このステップは、単に、索引トークンの発生頻度を計算することが適切な時に決定するチェック・ステップである。仮に、追加的なレコードがある場合、次のレコードは、このステップが繰り返される前に、処理されることになる。仮に、追加的なレコードがない場合、索引処理は、ステップ９０へと継続する。１実施形態において、追加的なレコードのチェックは、単に、ファイル・フラッグのエンドを探すことを備える。
【００５２】
図２に示す問い合わせデータに索引をつける処理における、第６のステップであって、最後のステップは、データベース内のトークンの発生頻度を計算することである（ステップ９０）。発生頻度はまた、コレクション頻度又はドキュメント頻度としても知られている。トークンの独立性を仮定すると、低い発生頻度は、より選択的なトークンを示す。トークンは、必ずしも独立していない。例えば、特定の群のトークンを含む句（フレーズ）は、データベース内に繰り返して含まれていてもよい。それにも拘わらず、トークンの独立性は、現実におおよそ等しいと許容できる。発生頻度は、レコードに関連することができるトークンの何れのタイプに対しても計算されてもよい。例えば、１実施形態において、発生頻度は、索引トークンに対して計算される。発生頻度は、レコードに関連することができるトークンの多様な異なるタイプに対しても計算されてもよい。例えば、もう１つの実施形態において、発生頻度は、索引トークンとレコード・トークンとに対して計算される。
【００５３】
１実施形態において、発生頻度は、全体としてデータベースに関する各ユニーク索引トークンに対して計算される。もう１つの実施形態において、発生頻度は、データベース内の各ドメインに関する各ユニーク索引トークンに対して計算される。計算を特定する他のレベルもまた、可能である。ある実施形態において、発生頻度は、ある各ユニーク索引トークンに対して計算されない。１実施形態において、このような索引トークンは、例えば「ｔｈｅ」、「ａｎｄ」などのノイズ単語を含む。索引トークンそれぞれの発生頻度を計算している間に索引トークンのリストを生成することは、頻度計算をより効率的にさせる。
【００５４】
発生頻度が計算された時に、データベース内にそれぞれの位置にあるトークンを含んでいるレコードに対するポインタ９４を含むトークン・テーブル９６を生成し、保存することは、効率的である。このテーブル９６は、トークンを含んでいるレコードの重複的な探索が必要とされることを防ぐ。１実施形態において、図１Ａに示すように、ポインタ９４は、包括的なテーブル９６に含まれている。もう１つの実施形態において、ポインタは、分離したテーブルに含まれており、それぞれのトークンに関連している。
【００５５】
図３を参照する。図３は、１実施形態に係る、照会処理のブロック図を示す。図３に示す照会を処理する際における、第１のステップは、照会４０を解剖することである（ステップ４０）。照会解剖は、データベースからのレコードを解剖する（ステップ４０）ために用いられた処理と同じ処理及びその変形を使用して実行されることができる。唯一の差は、レコード４４を解剖することによりレコード・トークン（Ｔ_Ｒｎ）４６になるのに対し、図１Ｂに示すように、照会５４を解剖することにより照会トークン（Ｔ_Ｑｎ）５６になることである。
【００５６】
図３に図示される照会を処理する際における、第２のステップは、照会４０を拡張することである（ステップ９０）。ある実施形態において、照会は、類似するトークンを照会トークンに追加することにより、拡張される。このような、１実施形態において、類似するトークンは、ユニーク・レコード・トークンのリストから選択される。ユニーク・レコード・トークンのリストの中からどのトークンを選択して照会トークンに追加する際に、照会トークンと候補のレコード・トークンとの様々な比較が、考えられる。ここで、理解を容易にするために、ユニーク・レコード・トークンのリストは、データベースの用語辞書であると考えてもよい。同様に、照会トークンと候補のレコード・トークンとの様々な比較は、照会のつづりチェックであると考えられる。１実施形態において、以下の比較が、考えられる：ミスマッチ文字の数；転置の数；文字列の長さ。もう１つの実施形態において、上記比較のサブセットが、考えられる。更なるもう１つの実施形態において、他の比較は、指定される比較の代わりに、又は指定される比較に加えて、考えられる。
【００５７】
ある実施形態において、ユニーク・レコード・トークンのリストからの見出し項目のトークンは、照会トークンとの比較のために使用される。他の実施形態において、ユニーク・レコード・トークンのリストからのより小さいサブセットのトークンは、比較のために使用される。例えば、このような、１実施形態において、照会トークンとして最初の２文字が同じであるレコード・トークンのサブセットは、独立した参照トークンとの比較のために使用される。実施例において、仮に、ユニーク・レコード・トークンのリストが、照会トークン＜ＸＥＮＩＴＨ＞として最初の２文字が同じであるレコード・トークンをまったく含まない場合、さらなる比較は、行われず、レコード・トークンは、照会トークン＜ＸＥＮＩＴＨ＞に対する１組の参照トークンに追加されない。候補のレコード・トークンを照会トークンに対して比較することによって照会を拡張する実施形態において、閾値は、どの候補レコード・トークンが１組の参照トークンに追加され、追加されないかを決定するように設定される。ある実施形態において、閾値は、照会トークンとの比較における候補レコード・トークンの類似性に基づいている。このような、１実施形態において、閾値は、候補のレコード・トークンを包含するために必要とされる類似性の最低値である。他の実施形態において、閾値は、照会トークンとの比較における候補レコード・トークンの非類似性に基づいている。このような、１実施形態において、閾値は、候補のレコード・トークンを除外するために必要とされる非類似性の最高値である。もう１つの実施形態において、閾値は、類似性と非類似性との結合である。
【００５８】
類似性及び非類似性の様々な計算は、照会トークンと使用されたレコード・トークンとの比較に依存することが可能である。類似性は、以下のように計算してもよい。ここで、各Ｓは、重み因子であり、ｃは、参照トークン及び候補レコード・トークンの双方に共通する文字の数であり、ｄは、参照トークンの長さであり、ｒは、候補レコード・トークンの長さであり、ｔ_ｒは、候補レコード・トークンに対して照会トークンを比較することによって見つけ出された文字の転置の数である。
（１）　　類似性＝（Ｓ_ｃｄ＊（ｃ／ｄ））＋
（Ｓ_ｒｄ＊（ｃ／ｒ））＋
（Ｓ_ｔｒ＊（（ｃ−ｔ_ｒ）／ｃ））
重み因子Ｓに関して、Ｓ_ｃｄは、候補レコード・トークンに共通する文字からなる照会トークンにおける文字のパーセンテージに対する重み因子であり、Ｓ_ｒｄは、照会トークンに共通する文字からなる候補レコード・トークンにおける文字のパーセンテージに対する重み因子であり、Ｓ_ｔｒは、転置されていない候補レコード・トークンと照会トークンとに共通する文字のパーセンテージに対する重み因子である。１実施形態において、同様な重み因子のすべては、値３００に設定され、候補レコードは、それらの計算された類似性が最低値の類似性を超えた場合に、１組の照会トークンに追加される。
【００５９】
非類似性は、以下のように計算してもよい。ここで、各Ｄは、重み因子であり、ｕ_ｃｄは、候補レコード・トークン内にない参照トークン内の文字数であり、ｄは、参照トークンの長さであり、ｕ_ｒｄは、参照トークン内にない候補レコード・トークン内の文字数であり、ｒは、候補レコード・トークンの長さであり、ｔ_ｒは、候補レコード・トークンに対して照会トークンを比較することによって見つけ出された文字の転置の数であり、ｃは、参照トークン及び候補レコード・トークンの双方に共通する文字の数である。
（２）　非類似性＝（Ｄ_ｃｄ＊（ｕ_ｃｄ／ｄ））＋
（Ｄ_ｒｄ＊（ｕ_ｒｄ／ｒ））＋
（Ｄ_ｔｒ＊（ｔ_ｒ／ｃ）
重み因子Ｄに関して、Ｄ_ｃｄは、候補レコード・トークン内にない参照トークン内の文字のパーセンテージに対するペナルティ因子であり、Ｄ_ｒｄは、参照トークン内にない候補レコード・トークン内の文字のパーセンテージに対するペナルティ因子であり、Ｐ_ｔｒは、転置されていない候補レコード・トークンと照会トークンとに共通する文字のパーセンテージに対する重み因子である。
【００６０】
１実施形態において、照会は、照会トークン（Ｔ_Ｑｎ）５６及び類似トークンから探索トークン（Ｔ_Ｓｎ）７２を生成することによって、拡張される。探索トークンの生成は、レコード・トークンから索引トークンを生成（ステップ６０）するために用いられた処理と同じ処理及びその変形を使用して実行されることができる。唯一の差は、レコード・トークン（ＴＲｎ）４６から索引トークン（ＴＩｎ）６２が生成されるのに対し、照会トークン（Ｔ_Ｑｎ）５６から探索トークン（ＴＳｎ）７２が生成されることである。
【００６１】
もう１つの実施形態において、図１Ｂに示されるように、照会は、単独で、照会トークン（Ｔ_Ｑｎ）５６から探索トークン（ＴＳｎ）７２を生成することにより、拡張される。再び、探索トークンの生成は、レコード・トークンから索引トークンを生成（ステップ６０）するために用いられた処理と同じ処理及びその変形を使用して実行されることができる。再び、唯一の差は、レコード・トークン（ＴＲｎ）４６から索引トークン（ＴＩｎ）６２が生成されるのに対し、照会トークン（Ｔ_Ｑｎ）５６から探索トークン（ＴＳｎ）７２が生成されることである。
【００６２】
図４を参照する。図４は、１実施形態に係る、問い合わせデータにアクセスする処理のブロック図を示す。図４に示す問い合わせデータにアクセスする際における、第１のステップは、第１の探索トークンを選択することである（ステップ１００）。１実施形態において、第１の探索トークンは、探索トークンからランダムに選択される。もう１つの実施形態において、第１の探索トークンは、探索トークン内の一定の順序で選択される。ある実施形態において、第１の探索トークンは、もっとも選択的な探索トークンである。ある実施形態において、探索トークンは、選択性によって順序づけられる。このような、１実施形態において、選択性は、索引を付けられたデータベース・レコード・セット内における発生頻度によって決定される。このような、もう１つの実施形態において、選択性は、索引を付けられたデータベース・レコード・セットの範囲内で特定のドメイン内における発生頻度によって決定される。このような、もう１つの実施形態において、選択性は、索引を付けられたデータベース・レコード・セットの範囲内でドメイン中の特定のフィールド内における発生頻度によって決定される。１実施形態において、第１の探索トークンは、照会によって指定されたドメインに対応するドメインの中で最も選択的な探索トークンである。もう１つの実施形態において、最も選択的な探索トークンは、ユニーク索引トークンのテーブル内で報告されている発生頻度を比較することによって見極められる。
【００６３】
図４に示す問い合わせデータにアクセスする際における、第２のステップは、問い合わせデータにアクセスすることである（ステップ１１０）。ある実施形態において、選択されたトークンに対するデータベース・レコード・セットの新たな探索が、開始される。他の実施形態において、一旦、第１の探索トークンが選択されると、選択されたトークンは、トークン・テーブル上で調べられる。このような、１実施形態において、図１Ｃに示すように、トークン・テーブル９６は、直接的に、選択されたトークン（Ｔ_Ｓ３）７２を含むデータベース内のレコードに対する１組のポインタ９４を返す。このような、もう１つの実施形態において、トークン・テーブル９６は、間接的に、選択されたトークンを含むデータベース内のレコードに対する１組のポインタ９４を返す。ポインタは、データベース内のレコードをアクセスするために使用されてもよい。
【００６４】
図４に示す問い合わせデータにアクセスする際における、第３のステップは、適切さを計算することである（ステップ１２０）。ある実施形態において、アクセスされた各レコードは、照会に対する適切さの見込みを表す重みを計算することによって評価される。このような、ある実施形態において、重みは、レコード・トークンに対して照会トークンを比較することによって計算される。このような、もう一つの実施形態において、重みは、照会により指定されたドメイン内のレコード・トークンに対して照会トークンを比較することによって計算される。
【００６５】
レコード・リンケージは、レコードを検査し、あるフィールドの結合にマッチする対のレコードの位置を突き止める処理である。レコード・リンケージ理論は、１対のレコードが互いにマッチする又は適切であると考えるための蓋然論の基礎である。本発明は、この理論を幾つかの実施形態に適用し、データベース・レコード・セットの範囲内で照会を個々のレコードにマッチする。照会は、組Ａのレコードからのレコードとして定義される。照会にマッチするための候補である問い合わせデータからのレコードは、組Ｂのレコードからのレコードとして定義される。各対のレコードは、組Ａから１つのレコード、実質的には照会と、組Ｂから１つのレコードとを含む。各対のレコードは、マッチする対Ｍの組の要素又はマッチしない対Ｕの組の要素の何れかである。
【００６６】
レコード・リンケージ理論の下、マッチを見極めるフィールドの能力は、フィールドの内容の選択性とフィールドの内容の正確性とに依存する。選択性は、レコード間を識別するフィールドの内容の能力の尺度である。例えば、フィールドが氏である場合、トークン＜Ｈｕｍｐｅｒｄｉｎｃｋ＞は、トークン＜Ｓｍｉｔｈ＞よりも、より一層選択的である。なぜならば、氏フィールドの中の＜Ｈｕｍｐｅｒｄｉｎｃｋ＞よりも、＜Ｓｍｉｔｈ＞を含むレコードの方が、より一層多そうだからである。選択性ｕ_ｉは、対のレコードがマッチしない対Ｕの組の要素である時に、２つのレコードがフィールド内に同じ内容を有する確率として、定義される。これは、数学的に、以下のように表される：ｕ_ｉ＝Ｐ（フィールド＿一致｜ρ∈Ｕ）。
【００６７】
正確さは、フィールド内のデータの信頼性の尺度である。例えば、より多く注意深く入力された又は入力後にチェックされたフィールド情報は、より少なく注意深く入力された又は入力後にチェックさないフィールド情報よりも、マッチした対に一致しそうである。正確性ｍ_ｉは、対のレコードがマッチする対Ｍの組の要素である時に、２つのレコードがフィールド内に同じ内容を有する確率として、定義される。これは、ある条件βが与えられた時にαが正しい確率Ｐ（α｜β）として、数学的に、以下のように表される：ｍ_ｉ＝Ｐ（フィールド＿一致｜ρ∈Ｍ）。
【００６８】
これらの尺度は、問い合わせデータ内のレコードがユーザの照会に基づいてユーザに対して興味のあるという見込みを予測するために、数学的に、量を定められ、適用される。我々は、第１のレコードがＡ組のレコードからであり、第２のレコードがＢ組のレコードからである、対のレコードを考える。Ａ及びＢは、幾らかの共通フィールドを共有する。各対のレコードρは、マッチする対Ｍの組又は要素マッチしない対Ｕの組の要素である。各対のレコードρと双方の組のレコードｉに共通する各ドメインとに対して、我々は、以下の量を定義する：一致重みＷ_Ａは、選択性ｕ_ｉに対する正確さｍ_ｉの比の対数である。
（３）　Ｗ_Ａ＝ｌｏｇ_２（ｍ_ｉ／ｕ_ｉ）
ある実施形態において、一致重みＷ_Ａは、候補レコードがそれぞれのドメインｉ内の照会トークンに等しいトークンを含む時に、候補レコードの適切さの見込みに足される。他の実施形態において、一致重みＷ_Ａは、候補レコードがそれぞれのフィールドｉ内の照会トークンに等しいトークンを含む時に、候補レコードの適切さの見込みに足される。他の実施形態において、ｉは、データの位置を指定する他のレベルを表す。
【００６９】
不一致重みＷ_Ｄは、１引く選択性ｕ_ｉに対する１引く正確さｍ_ｉの比の対数である。
（４）　Ｗ_Ｄ＝ｌｏｇ_２（（１−ｍ_ｉ）／（１−ｕ_ｉ））
ある実施形態において、不一致重みＷ_Ｄは、候補レコードがそれぞれのドメインｉ内の照会トークンに等しいトークンを含まない時に、候補レコードの適切さの見込みから引かれる。他の実施形態において、不一致重みＷ_Ｄは、候補レコードがそれぞれのフィールドｉ内の照会トークンに等しいトークンを含まない時に、候補レコードの適切さの見込みから引かれる。他の実施形態において、ｉは、データの位置を指定する他のレベルを表す。
【００７０】
ある実施形態において、隣接重みは、仮に、候補レコードが１つよりも多い照会トークンに等しい１つよりも多いトークンを含み、且つ適切なトークンが互いにすぐ近くに隣接する場合、候補レコードの適切な重みの見込みに足される。ある実施形態において、半隣接重みは、仮に、候補レコードが１つよりも多い照会トークンに等しい１つよりも多いトークンを含み、且つ適切なトークンが互いに近くに位置する場合、候補レコードの適切な重みの見込みに足される。１実施形態において、半隣接重みは、仮に、探索トークンが間にある１つのトークンによって分離している場合、足される。他の実施形態において、半隣接重みは、仮に、探索トークンが間にある１つよりも多いトークンによって分離している場合、足される。１実施形態において、隣接及び半隣接重みは、適切な探索トークンの重み因子である。近さに対する様々な重みスキームが、可能である。１実施形態において、候補レコードの適正な見込みは、照会トークンに関する候補レコード内のすべてのレコード・トークンにおける一致重みＷ_Ａと隣接重みと半隣接重みとを合計することによって計算される。一例を挙げれば、半隣接重みは、照会トークンに等しい候補レコード内のレコード・トークン間にある１つのトークンがある時のみ、足されるだけである。
【００７１】
図４に示す問い合わせデータにアクセスする際における、第４のステップは、計算された適切さを閾値に比較することである（ステップ１３０）。ある実施形態において、アクセスされた各レコードの重みは、１つ又はそれよりも多い閾値と比較される。他の実施形態において、候補レコードは、適正な重みの見込みによって順序づけれられている。この結果、アクセスされたレコードの組に対する重みは、より効果的に、１つ又はそれよりも多い閾値と比較される。
【００７２】
ある実施形態において、重みは、継続閾値と比較される。このような実施形態において、探索は、継続閾値を超えると終了する。あの時に、アクセスされたレコードのすべてが出力される。このような実施形態において、継続閾値を超えないと、異なる探索を誘発する（ステップ１４０）。前の探索の基礎として使用されたトークンは、利用可能な探索トークンの組から除去される。新たな探索における、第１のステップは、異なるトークンを選択して問い合わせデータにアクセスすることである。このような実施形態において、仮に、最も選択的なトークンが既にデータにアクセスするために使用されていた場合、２番目に最も選択的なトークンは、その続の探索に使用される。処理は、継続閾値を超えるか、又はすべての探索トークンがデータにアクセスするために使用されるまで、繰り返される。
【００７３】
ある実施形態において、アクセスされたレコードの重みは、提示閾値と比較される。このような実施形態において、アクセスされたレコードの一部は、出力される。提示閾値を使用する実施形態において、出力されたレコードは、提示閾値を超える適正な重みの見込みを持つレコードに限定される。
【００７４】
ある実施形態において、適正な重みの最も高い可能性のある見込みは、各照会に対して計算される。適正な重みの最も高い可能性のある見込みは、選択された重みスキームに依存する。ある実施形態において、開発者は、追加的なトークンが候補レコードの重みを減少させるように選択する。例えば、一致重みＷ_Ａのみを使用する実施形態において、適正な重みの最も高い可能性のある見込みは、それぞれのドメイン内のどの照会トークンをも含んでいる場合の候補レコードが有するであろう重みである。また、例えば、一致重みＷ_Ａと隣接重みとを使用する実施形態において、適正な重みの最も高い可能性のある見込みは、それぞれのドメイン内及び照会配置内のどの照会トークンをも含んでいる場合の候補レコードが有するであろう重みである。
【００７５】
ある実施形態において、探索を終了するための基礎として用いられる継続閾値重みは、最も高い可能性の重みのパーセンテージである。他の実施形態において、継続閾値重みは、絶対的な重みである。ある実施形態において、探索においてアクセスされたレコードを提示する基準として用いられる提示閾値は、最も高い可能性の重みのパーセンテージである。他の実施形態において、提示閾値重みは、絶対的な重みである。
【００７６】
ある実施形態において、アクセスされたレコードは、適切な重みの見込みによる出力に対して順序づけられる。他の実施形態において、アクセスされたレコードは、それらが検索された順序で出力される。更なる他の実施形態において、アクセスされたレコードは、他の順序で出力される。
【００７７】
ある実施形態は、データベース内において、図４の実施形態に示されていないアクセス処理ステップを含む。このステップにおいてアクセスされた情報の量は、オーバーフロー閾値と比較される。このような実施形態において、仮に、オーバーフロー閾値を超えると、現行の探索は、終了する。メモリ又はバッファは、消去される。このような、１実施形態において、新たな探索は、誘発される。新たな探索は、ブールのＡＮＤと一緒に結び付いたすべての探索トークンに基づいている。仮に、オーバーフロー閾値が新たな探索を誘発した場合、継続閾値は、その後、無能力になる。別の方法で、新たな探索においてアクセスされたレコードは、通例の探索と同じように処理される。ある実施形態において、探索を終了し、異なる探索を誘発するための基礎として用いられるオーバーフロー閾値は、ソフトウエアの誤り又は利用可能なメモリ空間ないしバッファ空間に関する警告と同様である。
【００７８】
最後に、１実施形態において、通例の探索に加えて、開発者は、各照会に対してブールのＡＮＤの実行と一緒に結び付いたすべての探索トークンに基づいて、新たな探索を選ぶ。
【００７９】
本発明の実施の形態を述べたが、当業者にとって、ここに開示した概念を一体化し、本発明の精神及び範囲を離れることなく実施される他の実施形態は、明らかである。ここで述べた実施形態は、すべての点において、実例として考えられ、制限するものとして考えられていない。従って、本発明の範囲は、特許請求の範囲のみによって限定されるもものと解釈される。
【図面の簡単な説明】
【図１】
図１は、従来技術として知られている、情報検索処理の機能ブロック図である。
図１Ａは、本発明の実施の形態に係る、索引処理中におけるレコードの進展図である。
図１Ｂは、本発明の実施の形態に係る、照会処理中における照会の進展図である。
図１Ｃは、本発明の実施の形態に係る、情報アクセス処理における探索トークンとレコードとの照会の相互作用を示す図である。
【図２】
本発明の実施の形態に係る、情報検索処理における情報索引部分の機能ブロック図である。
【図３】
本発明の実施の形態に係る、情報検索処理における照会処理部分の機能ブロック図である。
【図４】
本発明の実施の形態に係る、情報検索処理における情報アクセス部分の機能ブロック図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention generally relates to techniques for searching database information. In particular, the present invention relates to retrieving database information based on record linkage theory using query expansion.
[0002]
BACKGROUND OF THE INVENTION
Although the distinctions are not always clearly separated, information retrieval has traditionally been classified as belonging to one of two forms: browsing or querying. Browsing is generally more passive than referrals. Browsing often involves several stages of information retrieval system guidance by having the user access the database through a simple mechanism (eg, a menu topic) and then navigating through it. Let the user investigate the accessed information. Hypertext generally supports a browsing approach to information retrieval. Browsing is not always the most efficient technique for retrieving information from large databases, albeit with less demand on the user.
[0003]
Queries, compared to browsing, must allow the user to specify information that is of interest to him. Queries succeed only when the information of interest is specified in a way that matches the database language. This match often requires a compromise in selecting query terms. The query can be perceived as burdening the user, especially if the user is not trained. Queries can also result in poor search results. The query itself is traditionally categorized as belonging to one of two forms: a query for a Boolean search, or a query for a probabilistic retrieve.
[0004]
Queries about Boolean searches are information searches with the most established form. It requires the user to create an appropriate combination of words that matches both the information of interest and the database language. Boolean searching requires the user to specify only a finite number of terms in order to obtain an acceptable number of search results. Optimal Boolean searching requires the user to be familiar with Boolean operators and efficient combinations of terms. Nevertheless, users rarely use Boolean function words explicitly.
[0005]
Queries on probabilistic searches provide the user with an extensive search. Search results are generally compared to query terms using probabilistic algorithms and evaluated to determine how similar they match the query terms. Less frequent terms in the database are considered more discriminatory and are given more weight in predicting matches. Users are not bound by the many query terms they use on their own. This is because the evaluation of search results alleviates the problem of excessive search results.
[0006]
Nevertheless, the problem remains when querying electronic databases using probabilistic search methods. Due to misspellings in a query or database that are not spelled incorrectly, it is possible that relevant information may be overlooked during the search process. Similarly, the format of information that cannot be interpreted in a query or database can cause relevant information to be overlooked. If search speed is important, specifying many query terms results in an unsatisfactory slow query response. At the same time, if the user had to wait for the search results, the user might be frustrated. This is because the database is bound by other searches. Conversely, the user may be frustrated by the fact that a quick search results in poor results.
[0007]
Summary of the Invention
In one aspect, the invention includes a method of indexing a database. Database records are entered. Each record is dissected into record tokens using a pattern operation language. An index for the record is generated from the record token for each record.
[0008]
In one embodiment, dissecting comprises converting each record to an original token, identifying each original token, and converting the identified original token to a record token based on a pattern operating language. To do. In a related embodiment, the pattern operating language responds to a domain associated with the record token.
[0009]
In another embodiment, the index generation comprises generating a list of unique record tokens from the record tokens for each record, calculating a frequency of occurrence in the database for each unique record token, Generating a table of tokens. The index token table contains the frequency of occurrence in the database for each unique index token. In a related embodiment, the index token includes a speech equivalent for each record token. In a further related embodiment, a list of unique record tokens is also generated.
[0010]
In another aspect, the invention includes a method of indexing a database. Records in the database are entered and each record is dissected into record tokens. An index token is generated from each record token. An index token is the audio equivalent to a record token. The frequency of occurrence in the database is calculated for the unique index token. A table of index tokens is generated. The index token table contains the frequency of occurrence for unique index tokens.
[0011]
In one embodiment, a unique record token is also generated. In a related embodiment, each record is dissected into records using a pattern operation language. In another related embodiment, dissecting comprises converting each record to an original token, identifying each original token, and converting the identified original token to a record based on a pattern operation language. Converting to a token.
[0012]
In some embodiments of the foregoing, an index to the database is generated from a record token for each record. Each record token is associated with a domain in the database, the pattern operating language is responsive to the domain, the frequency of occurrence is calculated for the domain in the database, and the index of the unique record token is the index of the frequency of occurrence by the domain. Make a list.
[0013]
In one aspect, the invention relates to an apparatus for indexing a database. The input device accepts records from the database. The dissection unit dissects the records into record tokens, and the index unit generates an index of the record tokens in the database.
[0014]
In one embodiment, the decomposing unit includes a tokenizing unit, a token specifying unit, and a token converting unit. The tokenizer converts the record into an original token. The token specifying unit specifies each original token, and the token converting unit converts the specified original token into a record token based on the pattern operation language. In a related embodiment, the pattern operating language is responsive to the domain associated with the record token.
[0015]
In another embodiment, the index unit includes a token comparing unit, a frequency calculating unit, and a table generating unit. The token comparison unit generates a list of unique index tokens from the record token. The frequency calculation unit calculates the occurrence frequency in the database for each unique index token. The table generator generates a table including the frequency of occurrence for the unique index token. In a related embodiment, the index token is a speech equivalent for each record token, and the token comparator communicates with the anatomy via the token generator. In another embodiment, the record token comparator also generates a list of unique record tokens.
[0016]
In another aspect, the invention relates to an apparatus for indexing a database. The input device accepts records from the database and the dissection dissects the records into record tokens. The token generator generates an index token from each record token. The index token is the audio equivalent of each record token. The table generator generates, for each index token, a table including the frequency of occurrence of index tokens in the database calculated by the frequency generator and pointers to all records including the index token.
[0017]
In one embodiment, the record token comparison unit generates a list of unique record tokens from the record token for each record. In a related embodiment, the table generator generates a table that includes a pointer to each record in the database that includes an index token corresponding to a unique index token. In another related embodiment, the record token comparator in communication with the dissection also generates a list of unique record tokens.
[0018]
In a further related embodiment, the dissector dissects each record using a pattern motion language. In one such embodiment, the dissection unit further includes a tokenization unit, a token identification unit, and a token conversion unit. The token conversion unit converts a record into an original token. The token specifying unit specifies each original token, and the token converting unit converts the specified original token into a record token based on the pattern operation language.
[0019]
In one embodiment, the original token, each record token, and all of the respective index tokens are associated with the same domain in the database, and the pattern recognition is responsive to the domain associated with the token and the index token Is calculated by the domain. In one embodiment, the table generator generates a table that includes a frequency of occurrence for the unique index token and a pointer to each record in the database that includes the corresponding record token.
[0020]
In one aspect, the invention relates to a method of querying a database. Queries are entered and dissected into query tokens using a pattern operation language. The search token originates from the inquiry token. The search token is looking up on an index table to access a record in the database.
[0021]
In one embodiment, dissecting comprises transforming the query into original tokens, identifying each original token, and transforming the identified original token into a query token based on a pattern operation language. And In a related embodiment, the original token and the resulting query token are associated with the same domain in the database. In a further related embodiment, the pattern operation language is responsive to a domain associated with the token.
[0022]
In a further different related embodiment, the search token is derived from a query token. The search token is associated with a domain in the database associated with each query.
[0023]
In another aspect, the invention relates to a method of querying a database. The query is entered and dissected into a query token. The search token originates from the inquiry token. The search token generator checks the list of unique record tokens for tokens similar to the query token based on an information theory algorithm. It also involves translating the inquiry token and similar tokens into search tokens. The search token is the speech equivalent to the inquiry token or similar token. The search token is looked up on an index table to access a record in the database.
[0024]
In one embodiment, the search tokens are associated with the same domain in the database as each query token. In a related embodiment, the dissection is performed using a pattern motion language. In a further related embodiment, dissecting comprises transforming the query into original tokens, identifying each original token, and converting the identified original tokens into query tokens based on a pattern operation language. Converting.
[0025]
In one aspect, the invention pertains to an apparatus for querying a database. The query input device accepts a query as input. The dissection dissects the input into a query token using a pattern operation language. The generator generates a search token from the inquiry token. The database access unit accesses a record in the database according to the search token.
[0026]
In one embodiment, the dissection unit includes a tokenization unit, a specification unit, and a conversion unit. The tokenizer generates original tokens from the input, and the specifier specifies each of them. The conversion unit converts the specified original token into a query token based on the pattern operation language. In a related embodiment, the original token is associated with the same domain and search token in the database as each query token. In another related embodiment. The tokens relate to the same domain in the database, and the pattern operation language responds to the domain associated with them.
[0027]
In another aspect, the invention relates to an apparatus for querying a database. The query input device accepts the input and the dissection dissects it into a query token. The generator generates a search token from the inquiry token. The generator includes a query extension that adds tokens that qualify as similar to the query token based on the information theory algorithm. These are similar tokens. The generator also includes a translator that translates each query token and similar tokens into search tokens that are speech equivalents. The database access unit uses the search token to find a pointer corresponding to a record in the database.
[0028]
In one embodiment, each query token, each similar token, and each search token are associated with the same domain in the database. In another embodiment, the dissector dissects using a pattern motion language. In a related embodiment, the dissection unit includes a tokenization unit, a specification unit, and a conversion unit.
[0029]
In one aspect, the invention pertains to a method for accessing data in a database. The token is selected from the set as the first token to be searched. A set of records is retrieved from the database according to the selected token. Probability of relevance to the query is determined for each record in the set. The set of records is ordered by the likelihood of relevance to the query. The likelihood of the highest query suitability for the set is compared to a continuation threshold. If the threshold is exceeded, the search is terminated and a set of records is output. If not, a different token is selected for a new search.
[0030]
In one embodiment, the likelihood of relevance for the query is determined based on record linkage theory. In a related embodiment, the set of records consists of more than one record, and the output records are ordered by the likelihood of the query's relevance.
[0031]
In another embodiment, the frequency of occurrence in the database is determined for each token, the tokens are ordered by frequency of occurrence, and the token with the lowest frequency of occurrence is selected as the first search token. If the continuation threshold is not exceeded, the token with the next lowest token is selected as the next search token. In a related embodiment, the frequency of occurrence is for a domain in the database, and the tokens are each associated with a domain. In such embodiments, the tokens are ordered and the token with the lowest frequency of occurrence in the relevant domain is the first selected token. In a related embodiment, the likelihood of relevance to the query is determined for each record based on record linkage theory. In a further related embodiment, if the buffer of the retrieved record overflows, the buffer is cleared and a new search is started for the record containing all tokens.
[0032]
In another aspect, the invention relates to an apparatus for accessing data in a database. The database access unit searches a set of records from the database according to the token selected as the token to be searched for as the first token by the token selection unit. The relevance determiner determines the likelihood of query relevance for each record or records in the set. The relevance comparator orders each record in the set according to the relevance probability, and the threshold comparator compares the continuation threshold with the highest relevance probability. If it exceeds the continuation threshold, the adequacy comparing unit ends the search. If not, the suitability comparator excludes the selected token and causes the token selector to select another token. The output device returns a set of records when the threshold comparison unit has finished searching.
[0033]
In one embodiment, the likelihood of relevance for the query is determined based on record linkage theory. In a related embodiment, the database access unit retrieves more than one record and the output device returns the records ordered by the likelihood of the query's relevance.
[0034]
In another embodiment, the frequency comparison unit determines the frequency of occurrence in the database for each token and orders the tokens according to frequency of occurrence. The token selector selects the token having the lowest occurrence frequency as the first token to be searched. In a related embodiment, the frequency comparison unit determines the frequency of occurrence in a domain in the database associated with the token, and selects the token having the lowest frequency of occurrence in the relevant domain as the first token. In another related embodiment, the relevance determiner determines the likelihood of query relevance based on record linkage theory. In a further related embodiment, the buffer overflow arrestor clears the buffer when it overflows and sends an overflow signal to the token selector. The database access unit then retrieves a set of records from the database containing all tokens.
[0035]
BEST MODE FOR CARRYING OUT THE INVENTION
In summary, the present invention relates to an information retrieval process for an electronic database as shown in FIG. The information search process is a process in which a query by this process is used to access existing query data in the database. In the present invention, probabilities are used to select records in a database according to a user's query and search them. The information retrieval process can be generally separated into three steps as shown in FIG. 1: indexing the query data, processing the query, and accessing the query data. The last two steps of the information search process can be considered a search stage.
[0036]
Databases generally contain a large number of records, each of which can be queried by record number. Each record generally contains several domains. Similarly, each domain contains several fields. Each field may also contain free form text. For example, a national revenue service database may include separate records for each taxpayer. The taxpayer record may be numbered and may include separate domains for the taxpayer's home and work addresses. Each address domain may also include a street field, a town field, and a zip code field. The street field may accept free-form text such as, for example, "10910 Way Through The Woods", "71 Camino de Glasica". Databases generally do not require that all fields or domains contain information. For example, a taxpayer working as a freelance photographer may not have a business address, so the taxpayer's business address domain does not contain any data.
[0037]
Other database arrangements are possible, and the information retrieval process of the present invention can be readily applied to those situations. Nevertheless, for the purposes of the present application, the query data in the database comprises a number of records, each record comprising a number of domains, each domain comprising one or more fields, It is assumed that each field contains free-form text. The present invention operates on free-form text residing in a field, with the placement of the query data assumed.
[0038]
In summary, as shown in the block diagram of FIG. 1, the first step in the information search process is to index the inquiry data (step 10). Indexing the query data can be thought of as preparing the query data for the search stage in the information search process. FIG. 1A illustrates the evolution of a record during indexing, according to one embodiment of the present invention. When the index begins, the element 42 of each record 44 contains a set of record tokens (T _Rn ) Dissected at 46. The dissection process in some embodiments removes some parts of the record and normalizes the rest of the record. In the embodiment shown in FIG. 1A, the index token (T _In ) 62 is a record token (T _Rn ) 46.
[0039]
When the index is completed, the index token (T _In ) 62 and record token (T _Rn ) 46 will be analyzed to facilitate subsequent searches. In one embodiment, the unique record token (T _Rn ) A list of 46 is generated. In one embodiment, the unique index token (T _In ) 62 table 96 is generated. In a related embodiment, table 96 stores each unique index token (T _In ) 62 in the database (ν _n )including. In another related embodiment, table 96 includes a pointer 94 to a record in the query data that includes the token. In the embodiment shown in FIG. 1A, there is one comprehensive table that contains all of the available index information. In other embodiments, there are multiple tables that contain some of the available index information.
[0040]
The second step in the information search process shown in FIG. 1 is a query process (step 20). Processing the query can be considered as preparing the query for use in the information accessing the information retrieval processing stage. FIG. 1B illustrates the evolution of query 54 during query processing, according to one embodiment of the present invention. Element 52 of query 54 includes a set of query tokens (T _Qn ) Dissected at 56. In the embodiment shown in FIG. 1B, the dissection process removes certain parts of the query 54 and normalizes the rest of the query. In one embodiment, the query token (T _Qn ) Qualifies as a record token (T _Rn 3.) Any tokens from the list at 46 are added to the set of query tokens. In one embodiment, a search token (T) that can be used to access a record in the query data _Sn ) 72 is the inquiry token (T _Qn ) 56 and a similar token. In some embodiments, the query processing corresponds to a record processing in the query data.
[0041]
The third step in the information search process shown in FIG. 1 is access to inquiry data (step 30). Accessing query data is the pinnacle of query data and query preparation. FIG. 1C shows an access process according to the embodiment of the present invention. In one embodiment, the search token (T _Sn ) 72 are selected from a set of search tokens based on the selectivity of the search token. Search token (T _Sn ) 72 and the record 44 from the query data is retrieved using the token table 96. In one embodiment, a weight is calculated from each record that represents a likelihood of being appropriate for the user query 54. In a related embodiment, the weight calculation is based on record linkage theory. In one embodiment, the maximum weight for a set of retrieved records is compared to a threshold that determines whether to continue or terminate the search. In one embodiment, the retrieved records are returned to the user in an ordered manner. In some embodiments, the weight of each record is returned to the user, either alone or in association with the record. The end of the information retrieval process is a list or record, and in some embodiments, a weighted user, to evaluate the relevance of each record to the query.
[0042]
Please refer to FIG. FIG. 2 shows a detailed block diagram of a process for indexing the inquiry data according to the embodiment of the present invention. In the first step, one record 44 of the inquiry data is dissected (step 40). Dissecting records into tokens involves separating the data in the records into a set of tokens. In one embodiment, the developer of the query data defines a unit of character, which will be used as the basis for separating the contents of the record into tokens. In some such embodiments, these developer-defined characters are used alone. In such other embodiments, these developer-defined characters are used, along with default characters as a basis for separation. In other embodiments, the developer leaves the default character to be used as the sole basis for separation. A group of characters may be used as a basis for separation. In some embodiments, the separator itself is a token.
[0043]
For example, a record containing "big; bad.wolf and reading food"("big; bad. Wolf and red riding hood") becomes "<big>;<bad>.<Wolf and reading food>", where The semicolon and the period are defined as one unit of a separating character, and “<“ and ”>” indicates a token boundary. Similarly, a record including “big; bad.wolf and rereading hood” becomes “<big;bad.wolf><and><rereading><hood>”, where a space is defined as one unit of a separating character. As defined, "<" and ">" indicates a token boundary again. In another embodiment, the separating characters are removed in the separating process. In some embodiments, different characters are used as a basis for separation in different fields or domains.
[0044]
In some embodiments, dissecting the record includes removing certain tokens. In one embodiment, the developer defines a set of tokens, which will be removed after separating the contents of the record into tokens. In one embodiment, the developer-defined token is a single token to be removed. In another embodiment, developer-defined tokens are removed, along with default tokens. In other embodiments, the developer simply causes the default token to be removed. The removed token need not consist of a single character. For example, "<big><;><bad><.><Wolf and reading food>" becomes "<big><bad><wolf and reading food>", where the semicolon and period are removed. Defined as a token. In some embodiments, the developer defines different tokens, which will be removed in different fields or domains.
[0045]
In one embodiment, dissecting the record includes examining a set of tokens resulting from the separation process for the pattern and following one or more tokens in the recognized pattern. In such an embodiment, the attributes of each token are determined once the record is separated into tokens. In one embodiment, attributes include type, length, value, omission, and half-string. In other embodiments, additional attributes are determined. In yet another embodiment, a small number of attributes are determined. In any embodiment, determining certain attributes of the token may override the requirement to determine other attributes of the token. In one embodiment using the type attribute, the type may be a number, an alphabet, a leading number followed by one or more characters, a leading alphabet followed by one or more digits, and the first two characters. Includes compound mixtures, including mixtures of numbers and letters that do not fall into any of the two categories, and specials, including special characters that are not commonly encountered. In another embodiment using a type attribute, another type is determined. In one embodiment, the alphabet type is a sensitive case. In some embodiments, additional developer-defined types are used in combination with the default types. In other embodiments, the types defined by the developer are used, excluding the default types. For example, in one embodiment, the token <aBcdef> is six characters long and has the value of "aBcdef" and has the attribute of a token of the alphabet type, where the tokens in the alphabet are sensitive.
[0046]
In embodiments where dissection is performed that involves modifying the token's recognition pattern in several ways, the pattern must be defined for recognition based on possible token attributes. In some such embodiments, patterns are defined to operate only on conditions that occur within a particular domain. In such other embodiments, the pattern is defined to operate on conditions that occur anywhere within the set of domains. Pattern matching starts with the first token and is done one token at a time. A multi-pattern match may be applied to one record. A pattern may be defined by any attribute of a token, a portion of a token, or a set of tokens. In one embodiment, the pattern is defined by attributes of a single token. In another embodiment, the pattern is defined by attributes of a set of tokens. For example, in one embodiment, the pattern has a token length of less than ten, the first four characters of the token being a half string of "ANTI", and not constraining where the occurrence occurs in the token. CS "is defined as a half-character string. In an embodiment, the <ANTICS> and <ANTI-CSAR> tokens will be recognized for operation. In contrast, in that example, the token of <ANTIPATHY> would not be recognized. This is because the second half character string is not restricted. Also, the <ANTIPOLITICS> token will not be recognized. This is because the character length is not restricted.
[0047]
In embodiments where dissection is performed including modifying the token's recognition pattern in several ways, many actions are taken to modify the pattern. In one embodiment, the action taken in response to the recognized pattern is to change one of the attributes of the pattern. In another embodiment, the action taken in response to the recognized pattern is to concatenate portions of the pattern. In yet another embodiment, the action taken in response to the recognized pattern is to print the information to be bagged. In other embodiments, other operations are performed. Some embodiments perform one operation on one half-string of one token. Some embodiments perform many actions in response to one recognized pattern. For example, in one embodiment, the command “SET the value of <token> to (1: 2) <token>” has a token of the alphabet type whose length is 7 and whose first two characters are “EX”. Defined to complete pattern recognition on half strings. In an embodiment, the token of <EXAMPLE> is recognized as fitting the pattern, and the result of executing the command changes the value of the token to the first two characters of the original token, or to “EX”. In other embodiments, the values of words that are not normally useful for searching, eg, noise words such as “at”, “by”, “on”, are set to 0, so that they are unique index tokens. Removed from the list. As shown in FIG. 1A, dissecting can be accomplished by converting database records 44 to record tokens (T _Rn ) 46 is converted.
[0048]
The second step in the process of indexing the inquiry data shown in FIG. 2 is to determine a unique record token (step 50). Determining the unique record token will cause a list of unique record tokens to be generated. Such a list may be described as a term dictionary of a database. In one embodiment, certain fields are excluded from contributing to the list. In another embodiment, certain domains are excluded from contributing to the list. In one embodiment, tokens are excluded from contributing to a list of unique tokens based on their type. In another embodiment, tokens are excluded from contributing to a list of unique tokens based on their type and another attribute. In some embodiments, excluded types or other attributes are specified for the domain. In some embodiments, excluded types or other attributes are specified for the record as a whole. For example, in one embodiment, the developer excludes all numeric tokens, and tokens with five or more characters, from the list of unique tokens. In another embodiment, step 50 is skipped. In yet another embodiment, step 50 occurs after the process of indexing query data shown in FIG.
[0049]
The third step in the process of indexing the inquiry data shown in FIG. _Rn ) 46 to the index token (T _In ) 62 (step 60). Step 60 is shown in FIG. 1A. In some embodiments, the index token is a record token itself. In the embodiment described above, step 70 is a duplicate of step 50. In another embodiment, as shown in FIG. 1A, the index token (T _In ) 62 is a record token (T _Rn ) Is the audio equivalent. In these embodiments, the index token is generally generated by translating the record token into a spoken language. In one such embodiment, the spoken language is NYSIIS. In another such embodiment, the spoken language is SOUNDEX. In yet other such embodiments, the speech equivalent is based on another spoken language or a variant thereof. In one embodiment, there are multiple sets of index tokens, each based on a different spoken language or a variant thereof. In one embodiment, only record tokens of the alphabet type are translated, and other types of record tokens are not used to generate index tokens. In another embodiment, alphabet types and other types of record tokens generate index tokens, but only the alphabet of the record token is translated into an index token.
[0050]
The fourth step in the process of indexing the inquiry data shown in FIG. 2 is to determine a unique index token (step 70). Step 70 is much similar to step 50. Determining the unique index token will cause a list of unique index tokens to be generated. Such a list may be described as a dictionary of index terms. In one embodiment, certain fields are excluded from contributing to the list. In another embodiment, certain domains are excluded from contributing to the list. In one embodiment, tokens are excluded from contributing to a list of unique tokens based on their type. In another embodiment, tokens are excluded from contributing to a list of unique tokens based on their type and another attribute. In some embodiments, excluded types and other attributes are specified with respect to the domain. In some embodiments, excluded types and other attributes are specified for the record as a whole. For example, in one embodiment, the developer removes all alphabetic tokens and tokens with five or more characters from the list of unique tokens. In another embodiment, step 70 is skipped. In yet another embodiment, step 70 is performed after step 80. In another embodiment, step 70 is performed as part of step 80.
[0051]
The fifth step in the process of indexing the inquiry data shown in FIG. 2 is to check for additional records (step 80). This step is simply a check step to determine when it is appropriate to calculate the frequency of occurrence of index tokens. If there are additional records, the next record will be processed before this step is repeated. If there are no additional records, the indexing process continues to step 90. In one embodiment, checking for additional records simply comprises looking for the end of a file flag.
[0052]
The sixth step in the process of indexing the inquiry data shown in FIG. 2, which is the last step, is to calculate the frequency of occurrence of tokens in the database (step 90). Frequency of occurrence is also known as collection frequency or document frequency. Assuming token independence, a lower frequency of occurrence indicates a more selective token. Tokens are not necessarily independent. For example, a phrase containing a particular group of tokens may be repeatedly included in the database. Nevertheless, the independence of tokens can be tolerated as being approximately equal in reality. The frequency of occurrence may be calculated for any type of token that can be associated with the record. For example, in one embodiment, the frequency of occurrence is calculated for the index token. Frequency of occurrence may also be calculated for a variety of different types of tokens that may be associated with the record. For example, in another embodiment, the frequency of occurrence is calculated for index tokens and record tokens.
[0053]
In one embodiment, a frequency of occurrence is calculated for each unique index token for the database as a whole. In another embodiment, a frequency of occurrence is calculated for each unique index token for each domain in the database. Other levels of specifying calculations are also possible. In some embodiments, a frequency of occurrence is not calculated for each unique index token. In one embodiment, such index tokens include noise words, such as, for example, "the", "and". Generating a list of index tokens while calculating the frequency of occurrence of each index token makes the frequency calculation more efficient.
[0054]
When the frequency of occurrence has been calculated, it is efficient to generate and store a token table 96 that contains pointers 94 to records containing the token at each location in the database. This table 96 prevents the need for a duplicate search for records containing tokens. In one embodiment, as shown in FIG. 1A, the pointer 94 is included in a comprehensive table 96. In another embodiment, the pointers are included in a separate table and are associated with each token.
[0055]
Please refer to FIG. FIG. 3 shows a block diagram of an inquiry process according to one embodiment. The first step in processing the query shown in FIG. 3 is to dissect the query 40 (step 40). The query dissection can be performed using the same process used to dissect the records from the database (step 40) and variants thereof. The only difference is that by dissecting record 44, the record token (T _Rn ) 46 whereas the query 54 (see FIG. 1B) dissects the query token (T _Qn ) 56.
[0056]
The second step in processing the query illustrated in FIG. 3 is to extend the query 40 (step 90). In some embodiments, the query is extended by adding a similar token to the query token. In one such embodiment, similar tokens are selected from a list of unique record tokens. In selecting which token from the list of unique record tokens to add to the query token, various comparisons between the query token and the candidate record token are possible. Here, for ease of understanding, the list of unique record tokens may be considered to be a term dictionary of the database. Similarly, various comparisons between the query token and the candidate record token are considered spell checking of the query. In one embodiment, the following comparisons are possible: number of mismatched characters; number of transpositions; length of string. In another embodiment, a subset of the above comparisons is contemplated. In yet another embodiment, other comparisons are contemplated instead of or in addition to the specified comparison.
[0057]
In one embodiment, the token of the heading entry from the list of unique record tokens is used for comparison with the query token. In another embodiment, a smaller subset of tokens from the list of unique record tokens is used for comparison. For example, in one such embodiment, a subset of record tokens where the first two characters are the same as the query token are used for comparison with an independent reference token. In an embodiment, if the list of unique record tokens does not include any record tokens whose first two characters are the same as the query token <XENTH>, no further comparison is performed and the record token is It is not added to the set of reference tokens for query token <XENTH>. In embodiments that extend the query by comparing candidate record tokens to query tokens, a threshold is set to determine which candidate record tokens are added to a set of reference tokens and are not added. Is done. In some embodiments, the threshold is based on the similarity of the candidate record token in comparison to the query token. In one such embodiment, the threshold is the minimum similarity required to include a candidate record token. In another embodiment, the threshold is based on the dissimilarity of the candidate record token in comparison to the query token. In one such embodiment, the threshold is the highest dissimilarity required to exclude candidate record tokens. In another embodiment, the threshold is a combination of similarity and dissimilarity.
[0058]
Various calculations of similarity and dissimilarity can depend on comparing the query token with the used record token. Similarity may be calculated as follows. Here, each S is a weighting factor, c is the number of characters common to both the reference token and the candidate record token, d is the length of the reference token, and r is The length of the token, t _r Is the number of character transpositions found by comparing the query token against the candidate record token.
(1) Similarity = (S _cd * (C / d)) +
(S _rd * (C / r)) +
(S _tr * ((Ct _r ) / C))
For the weighting factor S, S _cd Is a weighting factor for the percentage of characters in the query token consisting of characters common to candidate record tokens, S _rd Is the weighting factor for the percentage of characters in the candidate record token consisting of characters common to the query token, S _tr Is a weighting factor for the percentage of characters that are common to the untransposed candidate record token and the query token. In one embodiment, all of the similar weight factors are set to the value 300 and candidate records are added to the set of query tokens if their calculated similarity exceeds the lowest similarity. You.
[0059]
Dissimilarity may be calculated as follows. Where each D is a weight factor and u _cd Is the number of characters in the reference token that is not in the candidate record token, d is the length of the reference token, and u _rd Is the number of characters in the candidate record token that is not in the reference token, r is the length of the candidate record token, and t is _r Is the number of character transpositions found by comparing the query token to the candidate record token, and c is the number of characters common to both the reference token and the candidate record token.
(2) Dissimilarity = (D _cd * (U _cd / D)) +
(D _rd * (U _rd / R)) +
(D _tr * (T _r / C)
For the weighting factor D, D _cd Is the penalty factor for the percentage of characters in the reference token that are not in the candidate record token, and D _rd Is the penalty factor for the percentage of characters in the candidate record token that are not in the reference token, and P _tr Is a weighting factor for the percentage of characters that are common to the untransposed candidate record token and the query token.
[0060]
In one embodiment, the query is a query token (T _Qn ) 56 and similar tokens, the search token (T _Sn ) 72 is extended. The generation of the search token can be performed using the same process and its variants used to generate the index token from the record token (step 60). The only difference is that the index token (TIn) 62 is generated from the record token (TRn) 46 while the query token (Tn) is generated. _Qn ) 56 to generate a search token (TSn) 72.
[0061]
In another embodiment, as shown in FIG. 1B, the query alone is a query token (T _Qn ) 56 to generate a search token (TSn) 72. Again, the generation of the search token can be performed using the same process and its variants used to generate the index token from the record token (step 60). Again, the only difference is that the index token (TIn) 62 is generated from the record token (TRn) 46 while the query token (Tn) _Qn ) 56 to generate a search token (TSn) 72.
[0062]
Please refer to FIG. FIG. 4 is a block diagram of a process for accessing inquiry data according to one embodiment. The first step in accessing the inquiry data shown in FIG. 4 is to select a first search token (step 100). In one embodiment, the first search token is randomly selected from the search token. In another embodiment, the first search token is selected in a certain order within the search token. In some embodiments, the first search token is the most selective search token. In some embodiments, search tokens are ordered by selectivity. In one such embodiment, selectivity is determined by the frequency of occurrence within the indexed set of database records. In another such embodiment, selectivity is determined by the frequency of occurrence within a particular domain within the set of indexed database records. In another such embodiment, selectivity is determined by the frequency of occurrence within certain fields in the domain within the set of indexed database records. In one embodiment, the first search token is the most selective search token among the domains corresponding to the domain specified by the query. In another embodiment, the most selective search token is determined by comparing the frequency of occurrence reported in the table of unique index tokens.
[0063]
The second step in accessing the inquiry data shown in FIG. 4 is to access the inquiry data (step 110). In one embodiment, a new search of the database record set for the selected token is initiated. In another embodiment, once the first search token is selected, the selected token is looked up on a token table. In one such embodiment, as shown in FIG. 1C, the token table 96 stores the selected token (T _S3 ) Returns a set of pointers 94 to the records in the database containing 72. In another such embodiment, the token table 96 indirectly returns a set of pointers 94 to the record in the database containing the selected token. Pointers may be used to access records in the database.
[0064]
The third step in accessing the inquiry data shown in FIG. 4 is to calculate adequacy (step 120). In one embodiment, each accessed record is evaluated by calculating a weight that represents the likelihood of relevance to the query. In such an embodiment, the weight is calculated by comparing the query token against the record token. In such another embodiment, the weight is calculated by comparing the query token to record tokens in the domain specified by the query.
[0065]
Record linkage is the process of examining records and locating pairs of records that match a combination of certain fields. Record linkage theory is the basis for probabilistic reasoning for considering a pair of records to match or be appropriate to each other. The present invention applies this theory to some embodiments, matching queries to individual records within a database record set. The query is defined as a record from the record in set A. Records from the query data that are candidates for matching the query are defined as records from records of set B. Each pair of records includes one record from set A, essentially a query, and one record from set B. Each pair of records is either a member of a matching pair M pair or an element of a non-matching pair U pair.
[0066]
Under record linkage theory, the ability of a field to determine a match depends on the selectivity of the field contents and the accuracy of the field contents. Selectivity is a measure of the ability of the field contents to distinguish between records. For example, if the field is Mr, the token <Humperdink> is much more selective than the token <Smith>. This is because there are likely to be more records containing <Smith> than <Humperdink> in the field. Selectivity u _i Is defined as the probability that two records have the same content in the field when the paired records are members of the unmatched paired U set. This is expressed mathematically as: u _i = P (field_match | ρ∈U).
[0067]
Accuracy is a measure of the reliability of the data in the field. For example, field information that was more carefully entered or checked after entry is more likely to match a matched pair than field information that was less carefully entered or not checked after entry. Accuracy m _i Is defined as the probability that two records have the same content in a field when the paired records are members of a matching paired M set. This is mathematically expressed as the probability P (α | β) that α is correct given a certain condition β: m _i = P (field_match | ρ∈M).
[0068]
These measures are mathematically quantified and applied to predict the likelihood that records in the query data will be of interest to the user based on the user's query. We consider a pair of records, where the first record is from record A and the second record is from record B. A and B share some common fields. The record ρ of each pair is an element of a pair of matching pairs M or a pair of non-matching pairs U. For each pair of records ρ and each domain common to both sets of records i, we define the following quantities: match weight W _A Is the selectivity u _i Accuracy m _i Is the logarithm of the ratio of
(3) W _A = Log ₂ (M _i / U _i )
In some embodiments, the match weight W _A Is added to the likelihood of the candidate record's relevance when the candidate record contains tokens equal to the query token in each domain i. In another embodiment, the match weight W _A Is added to the likelihood of the candidate record's suitability when the candidate record contains a token equal to the query token in each field i. In other embodiments, i represents another level that specifies the location of the data.
[0069]
Unmatched weight W _D Is 1 minus selectivity u _i 1 minus the accuracy m _i Is the logarithm of the ratio of
(4) W _D = Log ₂ ((1-m _i ) / (1-u _i ))
In some embodiments, the mismatch weight W _D Is subtracted from the likelihood of candidate record relevance when the candidate record does not include a token equal to the query token in each domain i. In another embodiment, the mismatch weight W _D Is subtracted from the likelihood of candidate record relevance when the candidate record does not contain a token equal to the query token in each field i. In other embodiments, i represents another level that specifies the location of the data.
[0070]
In some embodiments, the adjacency weight is the appropriate weight of the candidate record if the candidate record includes more than one token equal to more than one query token and the appropriate tokens are immediately adjacent to each other. It is added to the expectation of the weight. In some embodiments, the semi-neighbor weights may be determined by the appropriate weight of the candidate record if the candidate record includes more than one token equal to more than one query token and the appropriate tokens are located close to each other. It is added to the expectation of the weight. In one embodiment, semi-neighbor weights are added if the search tokens are separated by one token in between. In other embodiments, semi-neighbor weights are added if the search tokens are separated by more than one token in between. In one embodiment, the neighbor and semi-neighbor weights are appropriate search token weight factors. Various weighting schemes for proximity are possible. In one embodiment, the proper likelihood of a candidate record is the match weight W on all record tokens in the candidate record for the query token. _A , And the adjacent weight and the semi-adjacent weight are summed. In one example, semi-neighbor weights are only added when there is one token between record tokens in a candidate record equal to the query token.
[0071]
The fourth step in accessing the inquiry data shown in FIG. 4 is to compare the calculated adequacy to a threshold (step 130). In some embodiments, the weight of each accessed record is compared to one or more thresholds. In another embodiment, the candidate records are ordered by the likelihood of the proper weight. As a result, the weights for the set of accessed records are more effectively compared to one or more thresholds.
[0072]
In some embodiments, the weight is compared to a continuation threshold. In such an embodiment, the search ends when the continuation threshold is exceeded. At that time, all the accessed records are output. In such embodiments, a different search is triggered if the continuation threshold is not exceeded (step 140). The token used as the basis for the previous search is removed from the set of available search tokens. The first step in a new search is to select a different token to access the query data. In such an embodiment, if the most selective token was already used to access data, the second most selective token is used for subsequent searches. The process is repeated until the continuation threshold is exceeded or all search tokens have been used to access the data.
[0073]
In some embodiments, the weight of the accessed record is compared to a presentation threshold. In such an embodiment, some of the accessed records are output. In embodiments that use presentation thresholds, the output records are limited to those records that have a reasonable likelihood of exceeding the presentation threshold.
[0074]
In some embodiments, the highest likelihood of a reasonable weight is calculated for each query. The most likely likelihood of a reasonable weight depends on the weight scheme chosen. In some embodiments, the developer selects additional tokens to reduce the weight of the candidate record. For example, the match weight W _A In an embodiment that uses only, the most likely likelihood of a reasonable weight is the weight that a candidate record would have if it contained any query token in each domain. Also, for example, the matching weight W _A In an embodiment that uses a query weight and a neighbor weight, the most likely probabilities of the correct weight will be the candidate record if it contains any query tokens within the respective domain and query placement. Weight.
[0075]
In one embodiment, the continuation threshold weight used as the basis for terminating the search is a percentage of the highest likelihood weight. In another embodiment, the continuation threshold weight is an absolute weight. In one embodiment, the presentation threshold used as a criterion for presenting the accessed records in the search is a percentage of the highest likelihood weight. In another embodiment, the presentation threshold weight is an absolute weight.
[0076]
In one embodiment, the accessed records are ordered with respect to the output with the appropriate weight probability. In other embodiments, the accessed records are output in the order in which they were searched. In still other embodiments, the accessed records are output in another order.
[0077]
Some embodiments include access processing steps in the database that are not shown in the embodiment of FIG. The amount of information accessed in this step is compared to an overflow threshold. In such an embodiment, if the overflow threshold is exceeded, the current search ends. The memory or buffer is erased. In one such embodiment, a new search is triggered. The new search is based on all search tokens tied together with a Boolean AND. If the overflow threshold triggered a new search, the continuation threshold would then be disabled. Otherwise, the records accessed in the new search are processed in the same way as a normal search. In one embodiment, the overflow threshold used as a basis for terminating the search and triggering a different search is similar to a software error or a warning about available memory or buffer space.
[0078]
Finally, in one embodiment, in addition to the customary search, the developer selects a new search based on all search tokens associated with performing a Boolean AND for each query.
[0079]
Having described embodiments of the invention, it will be apparent to those skilled in the art that other embodiments that integrate the concepts disclosed herein and are practiced without departing from the spirit and scope of the invention. The embodiments described herein are, in all respects, considered illustrative and not restrictive. Therefore, the scope of the present invention should be construed as limited only by the appended claims.
[Brief description of the drawings]
FIG.
FIG. 1 is a functional block diagram of an information search process known as a conventional technique.
FIG. 1A is a progress diagram of a record during index processing according to the embodiment of the present invention.
FIG. 1B is a progress diagram of a query during a query process according to an embodiment of the present invention.
FIG. 1C is a diagram showing an interaction of a search token and a record inquiry in the information access processing according to the embodiment of the present invention.
FIG. 2
It is a functional block diagram of an information index part in information search processing concerning an embodiment of the invention.
FIG. 3
FIG. 3 is a functional block diagram of an inquiry processing part in the information search processing according to the embodiment of the present invention.
FIG. 4
It is a functional block diagram of an information access part in information search processing concerning an embodiment of the invention.

Claims

A method of indexing a database, the method comprising:
(A) inputting a record in a database;
(B) dissecting each record into a plurality of record tokens using a pattern operation language;
(C) generating an index for the record from the plurality of record tokens for each record;
A method that includes

The method of claim 1, wherein
(B) dissecting each record into a plurality of record tokens using a pattern operation language,
(I) converting each record into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens into the plurality of record tokens based on the pattern operation language;
Including, methods.

The method of claim 2, wherein the pattern behavior language is responsive to a domain associated with each of the plurality of record tokens.

The method of claim 1, wherein
Generating an index for the record from the plurality of record tokens for each record,
(I) generating a list of unique index tokens from the plurality of record tokens for each record;
(Ii) calculating the frequency of occurrence in the database for each unique index token;
(Iii) generating a table of index tokens including the frequency of occurrence in the database for each unique index token;
Including, methods.

5. The method of claim 4, wherein the method further comprises:
Prior to step (c), generating, from each record token, an index token containing a speech equivalent for said respective record token.

The method of claim 5, wherein the method further comprises the step of generating a list of unique record tokens.

A method of indexing a database, the method comprising:
(A) inputting a record in a database;
(B) dissecting each record into a plurality of record tokens;
(C) generating, from each record token, an index token, wherein said index token includes a speech equivalent for said respective record token;
(D) calculating the frequency of occurrence in the database for each unique index token;
(E) generating a table of index tokens containing the frequency of occurrence for each unique index token;
A method that includes

The method of claim 7, wherein the method further comprises generating a list of unique record tokens.

The method of claim 8, wherein step (b) is the step of dissecting each record into a plurality of record tokens using a pattern motion language.

The method of claim 9, wherein
(B) dissecting each record into a plurality of record tokens using a pattern operation language,
(I) converting each record into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens into the plurality of record tokens based on the pattern operation language;
Including, methods.

A method of indexing a database, the method comprising:
(A) inputting a record in a database;
(B) dissecting each record into a plurality of record tokens, wherein each of the plurality of record tokens is associated with a respective domain in the database, and wherein the dissection step comprises:
(I) converting each record into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens into the plurality of record tokens based on the pattern operation language, wherein the pattern operation language is associated with each of the plurality of record tokens. Responsive to each domain, comprising:
(C) generating a list of unique index tokens;
(D) generating an index token from each record token, wherein the index token is associated with the respective domain in a database associated with the respective record token, and An index token including a speech equivalent for said respective record token;
(E) generating a list of unique index tokens;
(F) calculating the frequency of occurrence in each domain of the database for each unique index token;
(G) generating a table of index tokens containing the frequency of occurrence in each domain of the database for each unique index;
(H) generating an index for the record from the plurality of record tokens for each record;
A method that includes

An apparatus for indexing a database, the apparatus comprising:
An input device that accepts records from the database,
An anatomical part in signal communication with the input device to dissect each record of the database into a plurality of record tokens using a pattern operating language
An index unit that is in signal communication with the dissection unit and generates an index for the plurality of record tokens in the database;
Equipment including.

An apparatus according to claim 12,
The anatomical part,
A tokenizer that is in signal communication with the input device and converts each record into a plurality of original tokens;
A token identification unit that is in signal communication with the tokenization unit and identifies each original token;
A token conversion unit that performs signal communication with the token specification unit and converts a plurality of specified original tokens into a plurality of record tokens based on the pattern operation language;
An apparatus, including:

The device according to claim 13,
The pattern operation language in a token conversion unit that converts a plurality of specified original tokens into a plurality of record tokens based on the pattern operation language is responsive to a domain associated with each of the plurality of record tokens. apparatus.

An apparatus according to claim 12,
The index unit is:
A token comparing unit that is in signal communication with the dissection unit and generates a list of unique index tokens from the plurality of record tokens for each record;
A frequency calculation unit that is in signal communication with the token comparison unit and calculates an occurrence frequency in the database for each unique index token on the list of unique index tokens;
A table generation unit that performs signal communication with the frequency calculation unit and generates a table including the occurrence frequency calculated by the frequency calculation unit for each unique index token;
An apparatus, including:

The device of claim 15, wherein the device further comprises:
A token generator in signal communication with the dissection unit for generating at least one index token for each record token, wherein the at least one index token is a voice equivalent for the respective record token. , Including a token generator,
The apparatus wherein the token comparison unit is in signal communication with the anatomy unit via the token generation unit.

17. The device according to claim 16, wherein the device further comprises:
An apparatus including a record token comparison unit in signal communication with the dissection unit and for each record generating a list of unique record tokens from the plurality of record tokens.

An apparatus for indexing a database, the apparatus comprising:
An input device that accepts records from the database,
An anatomical part in signal communication with the input device to dissect each record of the database into a plurality of record tokens;
A token generator in signal communication with the dissection unit for generating at least one index token for each record token, wherein the at least one index token is a voice equivalent for the respective record token. A token generator, comprising:
A frequency calculation unit that performs signal communication with the token generation unit and calculates an occurrence frequency in the database for each unique index token;
A table in signal communication with the frequency calculator and including a pointer to each record in the database including the occurrence frequency calculated by the frequency calculator for each unique index token and an index token corresponding to the unique index token. A table generator to generate,
Equipment including.

The device of claim 18, wherein the device further comprises:
An apparatus, comprising: a token comparing unit in signal communication with the dissection unit, for generating a list of unique index tokens from the plurality of record tokens for each record.

20. The apparatus of claim 18, wherein the dissection dissects each record in the database into a plurality of record tokens using a pattern motion language.

The apparatus according to claim 20, wherein
The anatomical part further comprises
A tokenizer that is in signal communication with the input device and converts each record into a plurality of original tokens;
A token identification unit that is in signal communication with the tokenization unit and identifies each original token;
A token conversion unit that performs signal communication with the token specification unit and converts a plurality of specified original tokens into a plurality of record tokens based on the pattern operation language;
An apparatus, including:

An apparatus for indexing a database, the apparatus comprising:
An input device that accepts records from the database,
An anatomical part in signal communication with the input device, for dissecting each record of the database into a plurality of record tokens using a pattern operating language,
A tokenizer that is in signal communication with the input device and converts each record into a plurality of original tokens;
A token identification unit that is in signal communication with the tokenization unit and identifies each original token;
A token conversion unit that is in signal communication with the token specification unit and converts a plurality of specified original tokens into a plurality of record tokens based on the pattern operation language, wherein the pattern operation language includes the plurality of records. An anatomical part further comprising a token translator responsive to the respective domain associated with each of the tokens;
A token comparing unit that is in signal communication with the dissection unit and generates a list of unique index tokens from the plurality of record tokens for each record;
A token generator in signal communication with the dissection unit for generating at least one index token for each record token, wherein the at least one index token is a voice equivalent for the respective record token. A token generator, comprising:
A record token comparing unit in signal communication with the anatomical unit to generate a list of unique record tokens from the at least one index token for each record;
A frequency calculation unit that performs signal communication with the token generation unit and calculates an occurrence frequency in the database for each unique index token;
A table in signal communication with the frequency calculator, the frequency of occurrence calculated by the frequency calculator for each unique index token, and a pointer to each record in the database including a record token corresponding to the unique index token; A table generator for generating
Equipment including.

A method of querying a database, the method comprising:
(A) inputting a query;
(B) dissecting the query into a plurality of query tokens using a pattern operation language;
(C) generating at least one search token from each query token;
(D) examining at least one search token on the index table to access at least one record in the database;
A method that includes

24. The method of claim 23,
(B) dissecting the query into a plurality of query tokens using a pattern operation language,
(I) converting the query into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens into the plurality of inquiry tokens based on the pattern operation language;
Including, methods.

The method of claim 24,
Each of said plurality of query tokens in step (b) of dissecting said query into a plurality of query tokens using a pattern operation language is associated with a respective domain in a database;
The dissection step (b) comprises:
(I) converting the query into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens into the plurality of query tokens based on the pattern operation language, wherein the pattern operation language is associated with each of the plurality of record tokens. Responding to each of said domains;
Including, methods.

The method of claim 24,
Each of said plurality of query tokens in step (b) of dissecting said query into a plurality of query tokens using a pattern operation language is associated with a respective domain in a database;
The dissection step (b) comprises:
(I) converting the query into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens to the plurality of query tokens based on the pattern operation language;
The method wherein the at least one search token in step (c) of generating at least one search token from each query token is associated with the domain in the database with which the respective query is associated.

A method of querying a database, the method comprising:
(A) inputting a query;
(B) dissecting the query into a plurality of query tokens;
(C) generating at least one search token from each query token,
(I) checking a list of unique record tokens in a database for at least one similar token, wherein the at least one similar token is similar to the respective query token based on an information theory algorithm. Steps that qualify as
(Ii) translating each respective query token and any similar tokens into the at least one search token, wherein the at least one search token includes a speech equivalent for the respective query token or similar token. A generating step, comprising:
(D) consulting at least one search token on an index table to access at least one record in the database;
A method that includes

28. The method according to claim 27,
Each of said plurality of query tokens in step (b) dissecting said query into a plurality of query tokens is associated with a respective domain in a database;
Each of the at least one search token in step (c) of generating at least one search token from a respective query token is associated with the domain in the database with which the respective query is associated;
The step (c) includes:
(I) checking a list of unique record tokens in a database for at least one similar token, wherein the at least one similar token is similar to the respective query token based on an information theory algorithm. Steps that qualify as
(Ii) translating each respective query token and any similar tokens into the at least one search token, wherein the at least one search token includes a speech equivalent for the respective query token or similar token. The method is a generating step comprising:

29. The method of claim 28,
Step (b) dissects the query into a plurality of record tokens using a pattern operation language, wherein each of the plurality of query tokens is associated with a respective domain in a database. Is the step, the method.

30. The method according to claim 29,
Each of the plurality of query tokens in step (b) of dissecting the query into a plurality of query tokens using a pattern operation language is associated with a respective domain in a database, and the dissection step (b) comprises:
(I) converting the query into a plurality of original tokens;
(Ii) identifying each original token;
(Iii) converting the plurality of identified original tokens to the plurality of inquiry tokens based on the pattern operation language.

An apparatus for querying a database, the apparatus comprising:
A query input device;
An anatomical part in signal communication with the query input device to dissect the input to the query input device into a plurality of query tokens using a pattern operating language;
A generator in signal communication with the anatomical part to generate at least one search token for each query token;
A database access unit that is in signal communication with the database and the generation unit and accesses a record in the database according to at least one of the plurality of search tokens generated by the generation unit;
Equipment including.

The device of claim 31,
The anatomical part,
A tokenizer that is in signal communication with the query input device and generates a plurality of original tokens from the input to the query input device;
A token identifying unit that is in signal communication with the tokenizing unit and identifies each of the original tokens generated by the tokenizing unit;
A token converting unit that performs signal communication with the token specifying unit and converts the plurality of specified original tokens into the plurality of record tokens based on the pattern operation language;
An apparatus, including:

The device of claim 32,
Each of the plurality of original tokens in the tokenizer for generating a plurality of original tokens from the input to the query input device is associated with a domain in a database;
In the token conversion unit that converts the plurality of specified original tokens into the plurality of record tokens based on the pattern operation language,
Each of the plurality of query tokens is associated with the domain in the database associated with a respective original token, and the pattern operating language is associated with the respective domain associated with each of the plurality of query tokens. The device that responds to.

The device of claim 32,
Each of the plurality of original tokens in the tokenizer for generating a plurality of original tokens from the input to the query input device is associated with a domain in a database;
Each of the plurality of query tokens in the token conversion unit that converts the plurality of specified original tokens into the plurality of record tokens based on the pattern operation language is associated with the database associated with each original token. Associated with the domain in
The apparatus wherein the search token at the generator for generating at least one search token for each query token is associated with the domain in the database associated with the respective query token.

An apparatus for querying a database, the apparatus comprising:
A query input device;
An dissection unit that is in signal communication with the query input device and dissects the input to the query input device into a plurality of query tokens;
A generator in signal communication with the anatomical unit to generate at least one search token for each query token;
A query extension unit that is in signal communication with the dissection unit, the query extension unit adding a similar token similar to at least one of the plurality of query tokens based on an information theory algorithm;
A translation unit in signal communication with the query extension unit, which translates each query token and each similar token output by the query extension unit into a respective search token, and converts each search token into a query token and a similar token. A generator including a translator, which includes a speech equivalent,
A database access unit in signal communication with the database and the generator, and accessing a record in the database in response to at least one respective search token generated by the generator;
Equipment including.

The device of claim 35,
Each of the plurality of query tokens in the dissection section dissecting the input to the query input device into a plurality of query tokens is associated with a domain in the database;
Each of the similar tokens in the query extension that adds a similar token similar to at least one of the plurality of query tokens based on an information theory algorithm is associated with the at least one of the plurality of query tokens. Associated with the domain in the database;
Each respective search token in the translator that translates each query token and each similar token output by the query extension into a respective search token is associated with the domain in the database associated with the respective query token. The device.

37. The apparatus of claim 36, wherein the dissection dissects the input to the query input device into a plurality of query tokens using a pattern motion language.

The device of claim 37,
The anatomical part further comprises
A tokenizer in signal communication with the query input device for converting each query into a plurality of original tokens, wherein each of the plurality of original tokens is associated with a respective domain in a database. Department and
A token identifying unit that communicates with the tokenizing unit, wherein the token identifying unit identifies each original token;
A token converting unit that is in signal communication with the token specifying unit and converts a plurality of specified original tokens into a plurality of inquiry tokens based on the pattern operation language, wherein each of the plurality of inquiry tokens is the original A token converter associated with the respective domain associated with the token;
An apparatus, including:

A method for accessing data in a database, the method comprising:
(A) selecting one token from the plurality of tokens as a first token to be searched;
(B) retrieving at least one record from the database in response to the selected token;
(C) determining a likelihood of the query being appropriate for each of the at least one record;
(D) ordering each of the at least one record according to the likelihood of the query being appropriate;
(E) comparing a continuation threshold to the highest likelihood of relevance of the query for the at least one record,
(I) terminating the search if the likelihood of the query's suitability for the at least one record exceeds the continuation threshold;
(Ii) if the continuation threshold exceeds the likelihood of the relevance of the query to the at least one record, selecting a different one of the plurality of tokens as the next token to be searched; step (b) Repeating steps from (e) to (e); and
(F) returning at least one retrieved record;
A method that includes

40. The method of claim 39, wherein step (c) determines a likelihood of the query's relevance for each of the at least one record based on record linkage theory.

41. The method of claim 40,
Said step (b) retrieving a plurality of records from said database according to said selected token;
Said step (c) determining a likelihood of the query being appropriate for each of said plurality of records based on record linkage theory;
Said step (d) ordering each of said plurality of records by a likelihood of said query being appropriate;
Said step (e) comparing a continuation threshold with the highest likelihood of the query being appropriate for said plurality of records,
(I) terminating the search if the likelihood of the query being appropriate for the plurality of records exceeds the continuation threshold;
(Ii) if the continuation threshold exceeds the likelihood of the relevance of the query to the plurality of records, selecting a different one of the plurality of tokens as the next token to be searched, from step (b) Repeating the steps up to step (e).
The method of claim 1, wherein said step (f) returns a plurality of retrieved records, wherein said plurality of retrieved records are ordered by a likelihood of relevance of said query.

40. The method of claim 39,
The method further comprises:
Prior to step (a), determining a frequency of occurrence in the database for each of the plurality of tokens;
Ordering each token according to the frequency of occurrence,
Said step (a) of selecting one token from said plurality of tokens as a first token to be searched,
The token having the lowest frequency of occurrence,
Said step (e) comparing a continuation threshold to the highest likelihood of the query being appropriate for said at least one record,
(I) terminating the search if the likelihood of the query's suitability for the at least one record exceeds the continuation threshold; or (ii) the continuation threshold is appropriate for the query for the at least one record. If said probability is exceeded, select a different one of the plurality of tokens as the next token to be searched, said different token having the lowest frequency of occurrence, from step (b) to step (e) Repeating the steps.

43. The method of claim 42,
The frequency of occurrence in the database is in each domain,
The method further comprises:
Prior to step (a), determining the frequency of occurrence in each domain of the database for each of a plurality of tokens, wherein the plurality of tokens is associated with a respective domain in the database;
Ordering each token by frequency of occurrence within each domain of the database;
Said step (a) of selecting one token from said plurality of tokens as a first token to be searched,
Said tokens having the lowest frequency of occurrence in said respective domains;
Said step (e) comparing a continuation threshold to the highest likelihood of the query being appropriate for said at least one record,
(I) terminating the search if the likelihood of the query's suitability for the at least one record exceeds the continuation threshold;
(Ii) if the continuation threshold exceeds the likelihood of the relevance of the query to the at least one record, selecting a different one of the plurality of tokens as the next token to be searched, wherein the different token Repeating steps (b) through (e), having the lowest frequency of occurrence in each of the domains.

44. The method of claim 43, wherein step (c) determines a likelihood of the query being appropriate for each record based on record linkage theory.

The method of claim 44, wherein the method further comprises:
Prior to step (c), checking a buffer of records retrieved for overflow, and if the buffer is overflow, erasing the buffer and retrieving at least one record from the database; , Wherein each of the at least one record includes all of the plurality of tokens.

An apparatus for accessing data in a database, the apparatus comprising:
A token selector for selecting one token from the plurality of tokens as a first token to be searched;
A database access unit that is in signal communication with the token selection unit and retrieves at least one record from the database according to the selected token;
An adequacy determining unit in signal communication with the database access unit for determining a likelihood of adequacy of a query for each of the at least one record;
An adequacy comparing unit that is in signal communication with the adequacy determining unit and orders each of the at least one record according to the likelihood of the query being appropriate;
Communicate in signal with the adequacy comparison unit and the token selection unit, compare a continuation threshold with the highest likelihood of the query being appropriate for the at least one record, and terminate the search if the continuation threshold is exceeded Or, if the continuation threshold is not exceeded, a threshold comparison unit that removes the selected token from the plurality of search tokens and inputs a remaining search token to the token selection unit;
An output device that is in signal communication with the threshold comparison unit and that returns the at least one retrieved record when the threshold comparison unit ends the search;
Equipment including.

47. The apparatus of claim 46, wherein the relevance determiner determines a probability of query relevance for each of the at least one record based on record linkage theory.

The apparatus of claim 47,
The database access unit retrieves a plurality of records from the database according to the selected token,
An adequacy determining unit in signal communication with the database access unit determines a probability of adequacy of a query for each of the plurality of records based on record linkage theory;
An adequacy comparing unit in signal communication with the adequacy determining unit orders each of the plurality of records according to the likelihood of the adequacy of the query;
A threshold comparing unit that is in signal communication with the appropriateness comparing unit and the token selecting unit, compares a continuation threshold with the highest likelihood of the appropriateness of the query for the plurality of records, and exceeds the continuation threshold; To end the search, or if the continuation threshold is not exceeded, remove the selected token from the plurality of search tokens and input the remaining search token to the token selection unit,
An apparatus, wherein an output device in signal communication with the threshold comparator returns the plurality of retrieved records when the threshold comparator terminates the search, ordered by a likelihood of relevance of the query.

The apparatus of claim 46,
The device further comprises:
Determining a frequency of occurrence in the database for each of the plurality of tokens, including a frequency comparison unit that orders each token according to the frequency of occurrence;
The token selecting unit is a token selecting unit that is in signal communication with the frequency comparing unit and selects one token from a plurality of tokens as a first token to be searched, wherein the token has the lowest occurrence frequency. The device that is the token selector.

The device of claim 49,
The frequency comparing unit determines the frequency of occurrence in the domain in the database for each of the plurality of tokens, and orders each token by the frequency of occurrence in each domain associated with the token,
The token selector for selecting one token from the plurality of tokens as a first token to be searched is a token selector wherein the token has the lowest frequency of occurrence in the respective domain. ,apparatus.

51. The apparatus of claim 50, wherein the adequacy determiner determines a likelihood of a query for each of the at least one record based on record linkage theory.

52. The device of claim 51, wherein the device further comprises:
A buffer overflow preventing unit that communicates with the database access unit, checks a buffer overflow, and if the buffer overflows, erases the buffer and sends an overflow signal to the token selection unit.
The token selector, which is in signal communication with the buffer overflow prevention unit, selects all tokens from the plurality of tokens as the tokens to be jointly searched according to a signal from the buffer overflow prevention unit. ,
The apparatus wherein the database access unit retrieves at least one record from the database, wherein each of the at least one record includes all of the plurality of tokens.