JP3678615B2

JP3678615B2 - Document search apparatus and document search method

Info

Publication number: JP3678615B2
Application number: JP28830999A
Authority: JP
Inventors: 光昭稲葉; 祐司菅野
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1999-10-08
Filing date: 1999-10-08
Publication date: 2005-08-03
Anticipated expiration: 2019-10-08
Also published as: JP2001109766A

Description

【０００１】
【発明の属する技術分野】
本発明は、検索条件にしたがって所望の文書を検索する文書検索装置と文書検索方法に関し、特に、各文書が検索条件に合致する度合と、各文書に付随する書誌事項、例えば新聞記事ならば日付の新しい順などの組み合わせによって検索結果を並べ替えて表示できるようにしたものである。
【０００２】
【従来の技術】
近年、文書中における検索語の出現頻度等に基づいて、文書と検索条件との適合度を求め、その高い順に結果を並び替えて表示する、文書ランキングの手法が注目されてきている。さらに、文書に付随する書誌事項、例えば新聞記事であれば日付をソート条件として指定し、日付の新しい記事から優先して表示するが、同一日付の記事については検索条件との適合度の高い順に表示するといった、柔軟な検索が実現されてきている。
【０００３】
従来の文書検索装置は、図１３に示すように、検索対象となる新聞記事の文書データ1301から辞書1302に載る単語の単語頻度情報を抽出し、単語頻度索引1304に格納する単語頻度情報抽出手段1303と、文書データ1301から日付・紙名コードといった書誌事項の情報を取り出し、書誌事項索引1306に格納する書誌事項抽出手段1305と、ユーザが検索条件及びソート条件からなる検索要求文字列を入力するための検索要求入力手段1307と、単語頻度索引1304を調べて検索条件に含まれる検索語の文書中での出現頻度を求める単語頻度算定手段1308と、レコード集合間の論理演算を行う論理演算手段1309と、検索条件と各レコードとの適合度を算出する適合度算定手段1310と、ソート条件に指定された並べ替えのための書誌情報を取得するソート情報取得手段1311と、書誌情報と適合度とによって検索結果のレコードリストを並べ替える結果並べ替え手段1312と、検索結果を表示する結果表示手段1313とを備えている。
【０００４】
なお、単語頻度索引1304には、単語頻度情報抽出手段1303の抽出動作により、検索対象文書中の辞書単語の出現頻度が格納される。
【０００５】
図１４は、従来の文書検索装置における検索の処理手順を示すフローチャートである。文書データ1301は、レコード区切り文字で区切られた複数のレコード（文書）から成り、各レコードは、フィールド区切り文字で区切られた複数のフィールドから成っている。図３は文書データの具体例を示しており、フィールド区切り文字が「＾Ｆ」、レコード区切り文字が「＾Ｒ」で、紙名コード、日付、記事本文という３つのフィールドから成る新聞記事データである。
【０００６】
単語頻度情報抽出手段1303は、予め文書データ1301を走査し、辞書1302に登録されている単語が各レコードの記事本文フィールドに何回出現しているかをカウントし、当該単語が出現しているレコード数及び総レコード数とともに、単語頻度索引1304に格納する。
【０００７】
また、書誌事項抽出手段1305は、予め文書データ1301を走査し、各レコードの書誌事項フィールドの内容を書誌事項索引1306に格納する。
【０００８】
まず、
ステップ1401：ユーザは検索要求入力手段1307により、検索要求文字列を入力する。検索要求文字列は検索条件、ソート条件の２つの部分からなる。図１５は検索要求文字列の具体例を示しており、「松下ＡＮＤ新製品」の部分は検索条件で、「松下」と「新製品」という２つの検索語をともに記事本文に含むような記事を検索することを意味し、「＠ＨＩＤＵＫＥ＠ＳＨＩＭＥＩ」の部分はソート条件で、検索結果を日付の新しい順で並べ、同じ日付なら紙名コードの小さい順で並べるということを意味している。日付、紙名コードがどちらも同じ場合は適合度の順に並べる。
【０００９】
ステップ1402：単語頻度算定手段1308は、全ての検索語を対象として、
ステップ1403：単語頻度索引1304を参照し、検索要求入力手段1307によって入力された検索条件に含まれる検索語について、当該単語が記事本文に出現するレコード数と各レコードの内部番号、各レコードにおける当該単語の出現頻度及び総レコード数を算出する。
【００１０】
ステップ1404：論理演算手段1309は、単語頻度算定手段1308の出力したレコード集合間の論理演算を行う。
【００１１】
ステップ1405：適合度算定手段1310は、全ての検索結果レコードを対象として、
ステップ1406：論理演算手段1309の出力した各レコードについて、検索条件との適合度（Ｒel）を、たとえば（数１）によって算出する。
Ｒel ＝ Σ（ＴＦi・ＩＤＦi）（Σはｉについて加算）
ＩＤＦi ＝１−ｌｏｇ₂（ＤＦi／ＮＤ）（数１）
ただし、ＴＦiは検索語Ｗiのレコード内出現頻度、ＤＦiは語Ｗiの出現するレコード数、ＮＤは総レコード数を表す。
【００１２】
なお、適合度の算出方法は（数１）に限らない。
【００１３】
ステップ1407：ソート情報取得手段1311は、書誌事項索引1306を参照し、適合度算定手段1310の出力した各レコードの、検索要求入力手段1307から入力されたソート条件に対応する書誌事項の値をソート情報として取得する。
【００１４】
図６はソート情報取得手段1311の出力内容例を示しており、日付と紙面コードの値をソート情報として取得している。
【００１５】
ステップ1408：結果並べ替え手段1312は、ソート情報として取得した複数の書誌事項をソートキーとして、ソート情報取得手段1311の出力を並べ替えて出力する。このとき、すべての書誌事項の値が同じレコードがあった場合には適合度の大きい順に並べ替える。
【００１６】
図１６は、結果並べ替え手段1312の出力内容の例である。
【００１７】
ステップ1409：結果表示手段1313は、結果並べ替え手段1312の出力を整形してユーザに提示する。
【００１８】
【発明が解決しようとする課題】
しかし、従来の構成では、並べ替えのキーとして、適合度よりも、ソート条件に指定した書誌事項の値などが優先されるために、適合度の低い文書が上位に、適合度の高い文書が下位にランクされてしまうことがあり、所望の文書を効率良く探し出すことができないという問題点があった。
【００１９】
たとえば、図８において最下位にランクされている文書（レコード内部番号10）がこれに当たる。
【００２０】
本発明は、こうした従来技術の課題を解決するものであり、ソート条件に指定された書誌事項の値を並べ替えのキーとして重要視しながらも、ユーザが適合度の範囲を限定することができ、指定した適合度範囲に入らない文書を、より下位にランクすることで、所望の文書を効率良く探し出すことが可能な文書検索装置を提供し、また、その文書検索方法を提供することを目的としている。
【００２１】
【課題を解決するための手段】
そこで、本発明の文書検索装置では、検索要求文字列として、検索条件、ソート条件に加え、適合度範囲指定を入力する検索要求入力手段と、文書の適合度が、指定された適合度範囲に該当するかどうかにより、異なる区分けフラグを付与する中間結果区分け手段とを設けている。
【００２５】
また、本発明の文書検索方法では、検索条件と、ソート条件と、検索条件に合致する度合を示す適合度の範囲とを指定する検索要求に対して、蓄積された文書データから検索条件を満たす文書を検索し、検出した各文書の適合度を算出し、各文書をソート条件にしたがって並べ替えるための各文書のソート情報を取得し、各文書の適合度を検索要求で指定された適合度の範囲と比較して、その範囲に入るかどうかを示す区分けフラグを各文書に付与し、各文書を、まず、区分けフラグで並べ替え、区分けフラグの値が同一だった場合には、ソート情報で並べ替え、ソート情報が同一だった場合に適合度の順に並べ替えて表示するようにしている。
【００２６】
そのため、適合度がユーザの指定した範囲から外れる文書を、より下位にランク付けすることができ、ソート条件を指定した場合の、適合度の低い文書が上位に、適合度の高い文書が下位にランクされてしまうという問題を回避することができる。
【００２７】
【発明の実施の形態】
以下、本発明の実施の形態について、図を参照しながら説明する。
【００２８】
（第１の実施の形態）
図１は本発明の第１の実施形態における文書検索装置の構成を示したブロック図である。
【００２９】
この装置は、従来の装置（図１３）と同様に、検索対象となる新聞記事の文書データ101から辞書102に載る単語の単語頻度情報を抽出して単語頻度索引104に格納する単語頻度情報抽出手段103、文書データ101から日付・紙名コードといった書誌事項の情報を取り出して書誌事項索引106に格納する書誌事項抽出手段105、検索要求入力手段107、単語頻度算定手段108、論理演算手段109、適合度算定手段110、ソート情報取得手段112、結果並べ替え手段114、及び、結果表示手段115を備えるとともに、適合度算定手段110によって算定された各レコードの適合度を最大値に対する相対値へ変換してソート情報取得手段112に出力する相対適合度算定手段111と、ソート情報取得手段112から出力された検索結果から適合度の値が指定した適合度範囲に入らないレコードを除く中間結果足切り手段113とを備えている。
【００３０】
図２のフローチャートは、第１の実施形態における検索の処理手順を示している。文書データ101は、レコード区切り文字で区切られた複数のレコード（文書）から成り、各レコードは、フィールド区切り文字で区切られた複数のフィールドから成っている。図３は文書データの具体例であり、フィールド区切り文字が「＾Ｆ」、レコード区切り文字が「＾Ｒ」で、紙名コード、日付、記事本文という３つのフィールドから成る新聞記事データである。
【００３１】
単語頻度情報抽出手段103は、予め文書データ101を走査し、辞書102に登録されている単語が各レコードの記事本文フィールドに何回出現しているかをカウントし、当該単語が出現しているレコード数及び総レコード数とともに、単語頻度索引1304に格納する。
【００３２】
また、書誌事項抽出手段105は、予め前記文書データ101を走査し、各レコードの書誌事項フィールドの内容を書誌事項索引106に格納する。
【００３３】
まず、
ステップ201：ユーザは検索要求入力手段107により、検索要求文字列を入力する。検索要求文字列は検索条件、ソート条件、適合度範囲指定の３つの部分からなる。図４は検索要求文字列の具体例を示しており、「松下ＡＮＤ新製品」の部分は検索条件で、「松下」と「新製品」という２つの検索語をともに記事本文に含むような記事を検索することを意味し、「＠ＨＩＤＵＫＥ＠ＳＨＩＭＥＩ」の部分はソート条件で、検索結果を日付の新しい順で並べ、同じ日付なら紙名コードの小さい順で並べるということを意味し、「＄７０：」の部分は適合度範囲指定で、適合度が最大である記事に対する相対適合度が７０以上である記事だけを結果に含めることを意味している。日付、紙名コードがどちらも同じ場合は適合度の順に並べる。なお、「＄７０：９０」のように適合度範囲指定の下限と上限とを両方指定して、適合度が７０以上９０以下の記事を結果に含めるといった指定や、上限だけを指定することも可能である。
【００３４】
ステップ202：単語頻度算定手段108は、全ての検索語を対象として、
ステップ203：単語頻度索引104を参照し、検索要求入力手段107によって入力された検索条件に含まれる検索語について、当該単語が記事本文に出現するレコード数と各レコードの内部番号、各レコードにおける当該単語の出現頻度、及び総レコード数を算出する。
【００３５】
ステップ204：論理演算手段109は、単語頻度算定手段108の出力したレコード集合間の論理演算を行う。図５は図４に示した検索要求文字列の場合の論理演算手段109の出力内容例を示しており、「松下」と「新製品」がともに出現するレコード集合が求められている。
【００３６】
ステップ205：適合度算定手段110は、全ての検索結果レコードを対象として、ステップ206：論理演算手段109の出力した各レコードについて、検索条件との適合度を、例えば、前記（数１）によって算出する。
【００３７】
ステップ207：相対適合度算定手段111は、適合度算定手段110の出力した各レコードの適合度を、それらの最大値で除して１００倍した値に変換する。
【００３８】
ステップ208：ソート情報取得手段112は、検索要求入力手段107で入力されたソート条件にしたがって書誌事項索引106を参照し、相対適合度算定手段111の出力した各レコードの、書誌事項の値をソート情報として取得する。図６はソート情報取得手段112の出力内容例で、日付と紙面コードの値をソート情報として取得している。
【００３９】
ステップ209：中間結果足切り手段113は、ソート情報取得手段112から出力される全てのレコードを対象にして、
ステップ210：そのレコードの適合度が検索要求入力手段107から入力された適合度範囲指定に該当しているかをチェックし、
ステップ211：該当していないレコードは、除外する。
【００４０】
図７は、適合度範囲指定が７０以上の場合に中間結果足切り手段113から出力される内容の例である。
【００４１】
ステップ212：結果並べ替え手段114は、ソート情報として取得した複数の書誌事項をソートキーにして、中間結果足切り手段113の出力を並べ替え、全ての書誌事項の値が同じレコードの場合には適合度の大きい順に並べ替えて出力する。図８は、この結果並べ替え手段114の出力内容の例である。日付が新しく、紙名コードの小さい順に結果文書が並べられ、かつ、適合度が指定した範囲外だった記事は除外されているため、ユーザは効率良く所望の文書を見つけることができる。
【００４２】
ステップ213：結果出力手段115は、結果並べ替え手段114の出力を整形してユーザに提示する。
【００４３】
このように、この文書検索装置では、検索した文書の中から適合度範囲に入らない文書を除いて表示することができるため、所望の文書を効率よく探し出すことができる。
【００４４】
また、検索結果の文書を適合度で足切りする場合に、検索結果を一旦適合度でソートし、適合度が所定値に満たない文書を足切りする方法も考えられるが、足切り前の検索結果の文書数は多いため、この文書を対象とするソートの処理負担は極めて重くなる。これに対して、この実施形態の方法では、文書の適合度が、指定された適合度範囲に入るかかどうかのチェックを、各文書に対して行うだけであるから、前記ソート処理に比べて軽い処理になる。従って、文書検索結果を迅速に表示することができる。
【００４５】
なお、ステップ208のソート情報の取得は、ステップ209のＹＥＳの後、即ち、検索結果の足切りをした後の文書を対象に行うようにしても良く、そうした場合には、ソート情報の取得の作業量を減らすことができる。
【００４６】
（第２の実施の形態）
第２の実施形態では、適合度のランクで区別して文書を表示する文書検索装置について説明する。
【００４７】
この装置は、図９に示すように、ソート情報取得手段912から出力された検索結果のレコードに対して、適合度の値が指定された適合度範囲に入るかどうかによって異なる区分けフラグを付与する中間結果区分け手段913を備えている。また、第１の実施形態と異なり、中間結果足切り手段は持たない。その他の構成は、第１の実施形態（図１）と変わりがない。
【００４８】
図１０は、第２の実施形態における、検索の処理手順を示すフローチャートである。ここで、ステップ1008までの手順は、第１の実施形態と同様の処理手順である。
【００４９】
ステップ1009：中間結果区分け手段913は、ソート情報取得手段912から出力される全てのレコードを対象にして、
ステップ1010：そのレコードの適合度が検索要求入力手段907から入力された適合度範囲指定に該当しているかをチェックし、
ステップ1011：適合度範囲に該当しないレコードについては区分けフラグの値として「２」を付与し、
ステップ1012：適合度範囲に該当するレコードについては区分けフラグの値として「１」を付与する。
【００５０】
図１１は、中間結果区分け手段913の出力内容の例である。
【００５１】
なお、適合度範囲として下限と上限の両方が指定された場合には、中間結果区分け手段913が、適合度範囲に該当しないレコードをさらに細分化して、上限を超えるレコードには区分けフラグの値として「２」を、下限に満たないレコードには区分けフラグの値として「３」を与えるようにしても良い。
【００５２】
ステップ1013：結果並べ替え手段914は、中間結果区分け手段913の出力を、区分けフラグの値の降順で並べ替え、区分けフラグの値が同じだった場合には、ソート情報として取得した複数の書誌事項をソートキーとして並べ替え、すべての書誌事項の値が同じレコードがあった場合には適合度の大きい順に並べ替えて出力する。
【００５３】
図１２は、結果並べ替え手段914の出力内容の例である。日付が新しく、紙名コードの小さい順に結果文書が並べられ、かつ、適合度が指定した範囲外だった記事は、適合度が指定範囲内にある記事群よりも下位にランクされるため、ユーザは効率良く所望の文書を見つけることができる。
【００５４】
ステップ1014：結果出力手段915は、結果並べ替え手段914の出力を整形してユーザに提示する。
【００５５】
このように、この実施形態の文書検索装置では、検索された全ての文書を、適合度範囲に入るものと入らないものとに区分して表示することができる。ユーザは、検索の目的に応じて、適合度範囲に該当する区分の文書だけを見て文書検索を終了することもできるし、特許文書を検索するときのように、１つの漏れも許されない場合には、適合度範囲から外れる区分の文書についても逐一調べることが可能である。
【００５６】
【発明の効果】
以上の説明から明らかなように、本発明の文書検索装置及び文書検索方法では、適合度がユーザの指定した範囲から外れる文書を、より下位にランク付けすることができる。
【００５７】
そうすることにより、ソート条件を指定した場合の、適合度の低い文書が上位に、適合度の高い文書が下位にランクされてしまうという問題を回避でき、所望の文書を効率良く検索することが可能になる。
【００５８】
また、各文書の適合度を最大値に対する相対値に変換し、検索要求における適合度範囲指定も相対値で指定することにより、適切な適合度範囲を容易に指定できる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態における文書検索装置の構成を示すブロック図、
【図２】第１の実施の形態における検索処理の手順を示す流れ図、
【図３】文書データの一例を示す図、
【図４】第１の実施形態における検索要求文字列の一例を示す図、
【図５】第１の実施形態における論理演算手段の出力内容の一例を示す図、
【図６】第１の実施形態におけるソート情報取得手段の出力内容の一例を示す図、
【図７】第１の実施形態における中間結果足切り手段の出力内容の一例を示す図、
【図８】第１の実施形態における結果並べ替え手段の出力内容の一例を示す図、
【図９】本発明の第２の実施の形態における文書検索装置の構成を示すブロック図、
【図１０】第２の実施の形態における検索処理の手順を示す流れ図、
【図１１】第２の実施形態における中間結果区分け手段の出力内容の一例を示す図、
【図１２】第２の実施形態における結果並べ替え手段の出力内容の一例を示す図、
【図１３】従来の文書検索装置の構成を示すブロック図、
【図１４】従来の検索処理の手順を示す流れ図、
【図１５】検索要求文字列の一例を示す図、
【図１６】結果並べ替え手段の出力内容の一例を示す図である。
【符号の説明】
101、901、1301 文書データ
102、902、1302 辞書
103、903、1303 単語頻度情報抽出手段
104、904、1304 単語頻度索引
105、905、1305 書誌事項抽出手段
106、906、1306 書誌事項索引
107、907、1307 検索要求入力手段
108、908、1308 単語頻度算定手段
109、909、1309 論理演算手段
110、910、1310 適合度算定手段
111、911 相対適合度算定手段
112、912、1311 ソート情報取得手段
113 中間結果足切り手段
114、914、1312 結果並べ替え手段
115、915、1313 結果表示手段
913 中間結果区分け手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search apparatus and a document search method for searching for a desired document in accordance with a search condition, and in particular, the degree to which each document matches the search condition and a bibliographic item accompanying each document, for example, a date for a newspaper article. The search results can be sorted and displayed according to combinations such as the new order.
[0002]
[Prior art]
2. Description of the Related Art In recent years, a document ranking method has been drawing attention in which the degree of matching between a document and a search condition is obtained based on the appearance frequency of a search word in the document and the results are rearranged and displayed in descending order. In addition, bibliographic items attached to documents, for example, newspaper articles, specify dates as sorting conditions and display them with priority over articles with newer dates. Articles with the same date are ordered in descending order of suitability with search conditions. Flexible search such as display has been realized.
[0003]
As shown in FIG. 13, the conventional document search apparatus extracts word frequency information of words included in a dictionary 1302 from document data 1301 of newspaper articles to be searched, and stores word frequency information extraction means for storing in a word frequency index 1304 1303, bibliographic item information such as date / paper name code is extracted from document data 1301, bibliographic item extracting means 1305 is stored in the bibliographic item index 1306, and the user inputs a search request character string including a search condition and a sort condition Search request input means 1307 for search, word frequency calculation means 1308 for finding the appearance frequency in the document of the search word included in the search condition by examining the word frequency index 1304, and logical operation means for performing logical operation between record sets 1309, a fitness calculation means 1310 for calculating the fitness between the search condition and each record, a sort information acquisition means 1311 for acquiring bibliographic information for sorting specified in the sort condition, Result sorting means 1312 for sorting the record list of search results according to bibliographic information and fitness, and result display means 1313 for displaying the search results are provided.
[0004]
The word frequency index 1304 stores the appearance frequency of dictionary words in the search target document by the extraction operation of the word frequency information extraction unit 1303.
[0005]
FIG. 14 is a flowchart showing a search processing procedure in a conventional document search apparatus. The document data 1301 is composed of a plurality of records (documents) delimited by a record delimiter, and each record is composed of a plurality of fields delimited by a field delimiter. FIG. 3 shows a specific example of the document data. In the newspaper article data, the field delimiter is “^ F”, the record delimiter is “^ R”, the paper name code, the date, and the article body. is there.
[0006]
The word frequency information extraction unit 1303 scans the document data 1301 in advance, counts how many times the word registered in the dictionary 1302 appears in the article body field of each record, and records in which the word appears The number and the total number of records are stored in the word frequency index 1304.
[0007]
The bibliographic item extracting unit 1305 scans the document data 1301 in advance and stores the contents of the bibliographic item field of each record in the bibliographic item index 1306.
[0008]
First,
Step 1401: The user inputs a search request character string using the search request input means 1307. The search request character string consists of two parts, a search condition and a sort condition. FIG. 15 shows a specific example of a search request character string, where “Matsushita AND New Product” is a search condition, and an article that includes both search terms “Matsushita” and “New Product” in the article text. "@HIDUKE @SHIMEI" means that the search results are arranged in the newest date order, and if the same date, the paper name codes are arranged in ascending order. If the date and paper name codes are the same, arrange them in order of fitness.
[0009]
Step 1402: The word frequency calculation means 1308 applies to all search terms,
Step 1403: Referring to the word frequency index 1304, for the search word included in the search condition input by the search request input means 1307, the number of records in which the word appears in the article body, the internal number of each record, the relevant number in each record The word appearance frequency and the total number of records are calculated.
[0010]
Step 1404: The logical operation means 1309 performs a logical operation between the record sets output by the word frequency calculation means 1308.
[0011]
Step 1405: The goodness-of-fit calculation means 1310 targets all search result records,
Step 1406: For each record output by the logical operation means 1309, the degree of matching (Rel) with the search condition is calculated by, for example, (Equation 1).
Rel = Σ (TFi · IDFi) (Σ is added for i)
IDFi = 1-log ₂ (DFi / ND) (Equation 1)
However, TFi represents the appearance frequency of the search word Wi in the record, DFi represents the number of records in which the word Wi appears, and ND represents the total number of records.
[0012]
Note that the method of calculating the fitness is not limited to (Equation 1).
[0013]
Step 1407: The sort information acquisition means 1311 refers to the bibliographic item index 1306, and sorts the values of the bibliographic items corresponding to the sort condition input from the search request input means 1307 of each record output by the fitness calculation means 1310. Obtain as information.
[0014]
FIG. 6 shows an example of the output contents of the sort information acquisition means 1311. The date and the page code value are acquired as sort information.
[0015]
Step 1408: The result sorting means 1312 sorts and outputs the output of the sort information obtaining means 1311 using a plurality of bibliographic items obtained as sort information as a sort key. At this time, if there are records having the same value of all the bibliographic items, the records are rearranged in descending order of fitness.
[0016]
FIG. 16 is an example of the output contents of the result rearranging means 1312.
[0017]
Step 1409: The result display means 1313 shapes the output of the result rearranging means 1312 and presents it to the user.
[0018]
[Problems to be solved by the invention]
However, in the conventional configuration, the value of the bibliographic item specified in the sort condition is given priority over the relevance as the sorting key, so the document with the lower relevance is higher and the document with the higher relevance is higher. There is a problem that it may be ranked lower, and a desired document cannot be searched efficiently.
[0019]
For example, this corresponds to the document ranked at the bottom in FIG. 8 (record internal number 10).
[0020]
The present invention solves such problems of the prior art, and allows the user to limit the range of the fitness level while emphasizing the value of the bibliographic item specified in the sort condition as a sort key. the document does not enter the specified goodness of fit range, more by rank lower, providing a document retrieval system which can find efficiently a desired document, also aims to provide a document search method It is said.
[0021]
[Means for Solving the Problems]
Therefore, in the document search apparatus of the present invention, as a search request character string, in addition to the search condition and the sort condition, search request input means for inputting a fitness range specification, and the document suitability are within the designated fitness range. Intermediate result classifying means for assigning different classification flags depending on whether or not it is applicable is provided.
[0025]
In the document search method of the present invention, the search condition is satisfied from the stored document data in response to a search request that specifies the search condition, the sort condition, and the fitness range indicating the degree of matching with the search condition. Search the document, calculate the fitness of each detected document, obtain the sort information of each document to sort each document according to the sorting condition, and specify the fitness of each document specified in the search request Compared with the range, the classification flag indicating whether it falls within the range is assigned to each document. Each document is first sorted by the classification flag, and if the value of the classification flag is the same, the sort information If the sort information is the same, the sort information is sorted and displayed in order of suitability.
[0026]
Therefore, the document outside the scope of matching level specified by the user, more can be ranked lower, of specifying a sort criteria, a low fitness document-level, high adaptability document to the lower The problem of being ranked can be avoided.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0028]
(First embodiment)
FIG. 1 is a block diagram showing a configuration of a document search apparatus according to the first embodiment of the present invention.
[0029]
Similar to the conventional apparatus (FIG. 13), this apparatus extracts word frequency information of words included in the dictionary 102 from the document data 101 of newspaper articles to be searched and extracts word frequency information stored in the word frequency index 104. Means 103, bibliographic item extracting means 105 for retrieving bibliographic information such as date / paper name code from the document data 101 and storing it in the bibliographic item index 106, search request input means 107, word frequency calculating means 108, logical operation means 109, A fitness level calculation unit 110, a sort information acquisition unit 112, a result sorting unit 114, and a result display unit 115 are provided, and the fitness level of each record calculated by the fitness level calculation unit 110 is converted into a relative value with respect to the maximum value. Relative fitness calculation means 111 that outputs to sort information acquisition means 112 and records whose fitness values do not fall within the specified fitness range from the search results output from sort information acquisition means 112 Intermediate results, excluding and a cutback means 113.
[0030]
The flowchart of FIG. 2 shows a search processing procedure in the first embodiment. The document data 101 is composed of a plurality of records (documents) delimited by record delimiters, and each record is composed of a plurality of fields delimited by field delimiters. FIG. 3 shows a specific example of document data, which is newspaper article data having a field delimiter “^ F”, a record delimiter “^ R”, and three fields of paper name code, date, and article body.
[0031]
The word frequency information extraction means 103 scans the document data 101 in advance, counts how many times the word registered in the dictionary 102 appears in the article body field of each record, and records in which the word appears The number and the total number of records are stored in the word frequency index 1304.
[0032]
Further, the bibliographic item extracting means 105 scans the document data 101 in advance and stores the contents of the bibliographic item field of each record in the bibliographic item index 106.
[0033]
First,
Step 201: The user inputs a search request character string using the search request input means 107. The search request character string consists of three parts: a search condition, a sort condition, and a fitness range specification. FIG. 4 shows a specific example of a search request character string, where “Matsushita AND New Product” is a search condition, and an article containing both search terms “Matsushita” and “New Product” in the article body. "@HIDUKE @SHIMEI" is a sort condition, and the search results are arranged in the newest date order, and if the same date, the paper name code is arranged in ascending order. "70:" part is a fitness range specification, which means that only articles having a relative fitness of 70 or more with respect to an article having the highest fitness are included in the result. If the date and paper name codes are the same, arrange them in order of fitness. It is also possible to specify both the lower limit and the upper limit of the fitness range specification, such as “$ 70: 90”, and to specify that the articles whose fitness is 70 or more and 90 or less are included in the result, or only the upper limit Is possible.
[0034]
Step 202: The word frequency calculation means 108 targets all search terms,
Step 203: Referring to the word frequency index 104, for the search word included in the search condition input by the search request input means 107, the number of records in which the word appears in the article body, the internal number of each record, the relevant number in each record The word appearance frequency and the total number of records are calculated.
[0035]
Step 204: The logical operation means 109 performs a logical operation between the record sets output by the word frequency calculation means 108. FIG. 5 shows an example of output contents of the logical operation means 109 in the case of the search request character string shown in FIG. 4, and a record set in which both “Matsushita” and “new product” appear is obtained.
[0036]
Step 205: The fitness level calculation means 110 calculates all the search result records as targets. Step 206: For each record output by the logical operation means 109, the fitness level with the search condition is calculated by, for example, (Equation 1). To do.
[0037]
Step 207: The relative fitness level calculation unit 111 converts the fitness level of each record output from the fitness level calculation unit 110 to a value obtained by dividing the record by the maximum value and multiplying by 100.
[0038]
Step 208: The sort information acquisition unit 112 refers to the bibliographic item index 106 according to the sort condition input by the search request input unit 107, and sorts the bibliographic item values of each record output by the relative fitness calculation unit 111. Obtain as information. FIG. 6 shows an example of the output contents of the sort information acquisition means 112, and the date and the page code value are acquired as sort information.
[0039]
Step 209: The intermediate result cut-off means 113 targets all the records output from the sort information acquisition means 112,
Step 210: Check whether the fitness of the record corresponds to the fitness range specification input from the search request input means 107,
Step 211: Exclude records that are not applicable.
[0040]
FIG. 7 is an example of contents output from the intermediate result cut-off means 113 when the fitness range designation is 70 or more.
[0041]
Step 212: The result sorting means 114 sorts the output of the intermediate result rounding means 113 using a plurality of bibliographic items acquired as sort information as a sort key, and conforms when all the bibliographic items have the same value. Sort and output in descending order. FIG. 8 shows an example of the output contents of the result rearranging means 114. Since the result documents are arranged in ascending order of the paper name code with the newest date and the articles whose fitness is outside the specified range are excluded, the user can efficiently find a desired document.
[0042]
Step 213: The result output means 115 shapes the output of the result rearranging means 114 and presents it to the user.
[0043]
As described above, in this document search apparatus, it is possible to display a document that does not fall within the fitness range from the searched documents, so that a desired document can be searched efficiently.
[0044]
In addition, when a search result document is cut off by relevance, it is possible to sort the search results once by relevance and cut off documents whose relevance does not reach the specified value. Since the number of documents as a result is large, the sort processing load for this document becomes extremely heavy. On the other hand, in the method of this embodiment, it is only necessary to check whether or not the document suitability falls within the specified suitability range for each document. Light processing. Therefore, the document search result can be displayed quickly.
[0045]
Note that the sort information acquisition in step 208 may be performed on the document after YES in step 209, that is, after the search result is cut off. The amount of work can be reduced.
[0046]
(Second Embodiment)
In the second embodiment, a document search apparatus that displays documents by distinguishing them according to the rank of fitness will be described.
[0047]
As shown in FIG. 9, this apparatus assigns different classification flags to the search result records output from the sort information acquisition unit 912 depending on whether the fitness value falls within the specified fitness range. Intermediate result classification means 913 is provided. Further, unlike the first embodiment, there is no intermediate result cut-off means. Other configurations are the same as those of the first embodiment (FIG. 1).
[0048]
FIG. 10 is a flowchart illustrating a search processing procedure according to the second embodiment. Here, the procedure up to step 1008 is the same processing procedure as in the first embodiment.
[0049]
Step 1009: The intermediate result classifying means 913 targets all records output from the sort information acquiring means 912,
Step 1010: Check whether the fitness of the record corresponds to the fitness range designation input from the search request input means 907,
Step 1011: For records that do not fall within the fitness range, “2” is assigned as the value of the classification flag,
Step 1012: “1” is assigned as the value of the classification flag for the record corresponding to the fitness range.
[0050]
FIG. 11 shows an example of the output contents of the intermediate result classifying means 913.
[0051]
If both the lower limit and the upper limit are specified as the fitness range, the intermediate result classification means 913 further subdivides records that do not fall within the fitness range, and records that exceed the upper limit as the value of the classification flag. It is also possible to give “3” as the value of the segmentation flag to records that are less than “2”.
[0052]
Step 1013: The result sorting means 914 sorts the output of the intermediate result sorting means 913 in descending order of the value of the classification flag. If the values of the classification flag are the same, a plurality of bibliographic items acquired as sort information Are sorted as sort keys, and if there are records with the same bibliographic item values, the records are sorted in descending order of suitability.
[0053]
FIG. 12 shows an example of the output contents of the result rearranging means 914. Articles with the newest date, the result documents arranged in ascending order of the paper name code, and articles whose fitness is outside the specified range are ranked lower than articles whose fitness is within the specified range. Can efficiently find a desired document.
[0054]
Step 1014: The result output means 915 shapes the output of the result rearranging means 914 and presents it to the user.
[0055]
As described above, in the document search apparatus according to this embodiment, all the searched documents can be displayed by being classified into those that fall within the fitness range and those that do not. Depending on the purpose of the search, the user can end the document search only by looking at the documents of the category that falls within the suitability range, or when one omission is not allowed, such as when searching for patent documents In addition, it is possible to examine each document that falls outside the fitness range.
[0056]
【The invention's effect】
As apparent from the above description, the document search apparatus and document retrieval process of the present invention, it is possible to fit it to the document outside the scope specified by the user, it ranks more lower.
[0057]
By doing so, it is possible to avoid the problem that a document with a low fitness level is ranked higher and a document with a higher fitness level lower than when a sort condition is specified, and a desired document can be searched efficiently. It becomes possible.
[0058]
In addition, by converting the fitness of each document into a relative value with respect to the maximum value and specifying the fitness level range in the search request as a relative value, an appropriate fitness level range can be easily specified.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document search apparatus according to a first embodiment of the present invention;
FIG. 2 is a flowchart showing a search processing procedure according to the first embodiment;
FIG. 3 is a diagram showing an example of document data;
FIG. 4 is a view showing an example of a search request character string in the first embodiment;
FIG. 5 is a diagram showing an example of output contents of the logical operation means in the first embodiment;
FIG. 6 is a diagram showing an example of output contents of sort information acquisition means in the first embodiment;
FIG. 7 is a view showing an example of output contents of an intermediate result cut-off means in the first embodiment;
FIG. 8 is a diagram showing an example of output contents of a result rearranging unit in the first embodiment;
FIG. 9 is a block diagram showing a configuration of a document search device according to the second embodiment of the present invention;
FIG. 10 is a flowchart showing a procedure of search processing according to the second embodiment;
FIG. 11 is a diagram showing an example of output contents of intermediate result classification means in the second embodiment;
FIG. 12 is a diagram showing an example of output contents of a result rearranging unit in the second embodiment;
FIG. 13 is a block diagram showing a configuration of a conventional document search apparatus;
FIG. 14 is a flowchart showing a conventional search processing procedure;
FIG. 15 is a diagram showing an example of a search request character string;
FIG. 16 is a diagram illustrating an example of output contents of a result rearranging unit.
[Explanation of symbols]
101, 901, 1301 Document data
102, 902, 1302 dictionary
103, 903, 1303 Word frequency information extraction means
104, 904, 1304 word frequency index
105, 905, 1305 Bibliographic item extraction means
106, 906, 1306 Bibliographic Index
107, 907, 1307 Search request input means
108, 908, 1308 Word frequency calculation means
109, 909, 1309 Logical operation means
110, 910, 1310 Conformity calculation means
111, 911 Relative fitness calculation means
112, 912, 1311 Sort information acquisition means
113 Intermediate result rounding off
114, 914, 1312 Result sorting means
115, 915, 1313 Result display means
913 Intermediate result classification means

Claims

In a document search device that searches stored document data according to a search condition and displays the search results sorted according to the sort condition, the search condition, the sort condition, and a fitness indicating the degree of matching with the search condition Search request input means for inputting a search request character string consisting of a range specification, search means for searching for a document satisfying the search conditions, and fitness calculation for calculating the fitness of each document searched by the search means Means for obtaining sort information for sorting each retrieved document according to the sort condition, and the degree of conformity for which the suitability of each retrieved document is specified in the range Search result classifying means for assigning to each document a classification flag indicating whether or not it falls within the goodness-of-fit range compared with the range; and the search result classification Each document with the classification flag output from the stage is first sorted by the classification flag. If the values of the classification flag are the same, the documents are sorted by the sort information, and the sort information is the same. A document search apparatus comprising: search result sorting means for rearranging in order of the fitness in case, and search result display means for displaying the search results sorted by the search result sorting means.

The search means searches for documents that match the search conditions and calculates the appearance frequency of search words in each document, and the fitness calculation means is based on the search word appearance frequencies calculated by the search means. The document search apparatus according to claim 1 , wherein the fitness of each document is calculated.

The search means searches for a document that matches the search condition, calculates the number of documents in which the search word appears, and the appearance frequency of the search word in each document, and the matching degree calculation means calculates the search word in each document. The document search apparatus according to claim 1 , wherein the fitness of each document is calculated based on the appearance frequency of the document and the number of documents in which the search word appears.

The relevance calculation means includes an absolute relevance calculation means for calculating the relevance of each document, and a relevance value of each document calculated by the absolute relevance calculation means relative to the highest relevance among them. Relative conformity calculation means for converting into a document, the conformity calculation means outputs the relative adaptability represented by the relative value as the adaptability of each document, and the search request input means The document search apparatus according to claim 1 , wherein the range specification is performed based on the relative fitness.

In a document search method for searching stored document data according to a search condition and displaying the search result sorted according to the sort condition, the search condition, the sort condition, and a fitness indicating the degree of matching with the search condition In response to a search request designating a range, a document satisfying the search condition is searched from accumulated document data, the fitness of each detected document is calculated, and the documents are rearranged according to the sort condition. The sort information of each document for the purpose is acquired, and the degree of matching of each document is compared with the range of the degree of matching specified in the search request, and a classification flag indicating whether the document falls within the range is given to each document The documents are first sorted by the sorting flag. If the values of the sorting flag are the same, the sorting information is sorted and the sorting information is the same. Document search method characterized in that the display is rearranged in order of relevance to the case of Tsu.

As the fitness of each detected document, a relative fitness for the highest fitness among the fitness of each document is calculated, and a range of fitness can be specified by the relative fitness in the search request. The document retrieval method according to claim 5 , wherein the document retrieval method is performed.