JP3924899B2

JP3924899B2 - Text search apparatus and text search method

Info

Publication number: JP3924899B2
Application number: JP03874398A
Authority: JP
Inventors: 剛弘小山
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-02-20
Filing date: 1998-02-20
Publication date: 2007-06-06
Anticipated expiration: 2018-02-20
Also published as: JPH11238068A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字列が不正確であり得るテキストを対象とするテキスト検索装置であって、特にその検索漏れの低減と検索結果における重み付けに関する。
【０００２】
【従来の技術】
従来より、検索文字列を指定し、文書や文字列に含まれる当該検索文字列を探索するテキスト検索装置があった。ワードプロセッサに搭載されている文字列検索機能は、そのようなものの一例である。
【０００３】
その検索対象である文書、文字列は、基本的には誤りがないことが前提とされる。そしてその検索においては、検索対象テキスト中に含まれる文字列が検索文字列と完全に一致した場合のみ、関心のある文字列が検索対象テキスト中に存在すると判断されていた。
【０００４】
これに対し、検索対象テキストが光学文字読取り装置（ＯＣＲ）で読み取られたテキストデータである場合には、その読み取りにおける認識誤りにより、不正確な文字列を含んだ不完全なテキストとなる確率が高い。日本語ＯＣＲは精度が低いため、特にそのおそれが高い。この不完全テキストに対し、上述のような検索文字列との完全一致による検索を行うと検索漏れが発生するおそれがある。つまり、検索対象テキストが正しく読み取られたものであるならばヒットしたはずである文字列部分が、認識誤りによりヒットしないことが起こりうる。
【０００５】
そのような検索漏れを防止するために、検索対象の曖昧さをある程度許容して検索を行う技術（以下、曖昧検索という。）が存在する。特開昭６２−４４８７８号公報に開示される第一の曖昧検索の従来技術は、認識の結果、複数の候補が得られた場合、検索対象テキスト中に候補文字を埋め込み、検索するものである（例．文［字学］認［識織］による［本木］文．．．）。特開平８−７０３３号公報に開示される第二の曖昧検索の従来技術は、文字認識を行った各文字について複数の候補が得られた場合にはインデックスにそれらを残すものである。この場合、認識結果を各文字ごとに格納したインデックスにおいて、認識対象の１文字に対して複数の認識結果の文字候補が格納されうる。この２つの技術は認識結果、すなわち検索対象テキストに曖昧さを持たせるものである。一方、特開平６−１９５３８７号公報、特開平７−１５２７７４号公報、特開平８−６３４８７号公報に開示される第三の曖昧検索の従来技術は、検索文字列の側に曖昧さを持たせるものである。この方法は、検索文字列中の誤って認識されやすい部分を、誤認識の可能性のある文字パターン（誤認識パターン）で置き換えた不完全検索文字列を作成し、正しい検索文字列だけでなく、不完全検索文字列によっても探索を行うものである。誤認識パターンのタイプとしては、文字誤り、誤分割、誤結合といったものがある。例えば、「字」は「学」と認識されやすいが、このようなタイプが文字誤りである。また、「化」は「イヒ」と認識されやすいが、このようなタイプが誤分割であり、一方、「５１」は「引」と認識されやすいが、このようなタイプが誤結合である。
【０００６】
曖昧検索を行うことにより、検索漏れの減少を図ることができるメリットがある一方、逆に本来、検索文字列とは異なる文字列が検索文字列と一致するとされる検索誤りが含まれる可能性もある。
【０００７】
【発明が解決しようとする課題】
上述の第一の曖昧検索の従来技術は、認識結果である検索対象テキストの容量が増加する、認識結果に残らないと検索されないといった問題があった。また、誤認識パターンのうち誤分割、誤結合に対応できないという問題もあった。
【０００８】
第二の曖昧検索の従来技術においても、認識結果であるインデックスの容量が増加する、認識結果に残らないと検索されないという問題があった。
【０００９】
第三の曖昧検索の従来技術は、検索文字列とは別に誤認識パターンを用意する必要があり、その容量が増加するという問題があった。また例えば誤結合は、連続する文字の組み合わせに依存して生じ、そのため多くのパターンが存在しうる。このように起こりうる誤認識パターンを全て予め用意することは困難である。そして予め用意されていない誤認識パターンが発生すると、検索漏れとなるという問題があった。検索に用いられる検索対象テキスト、インデックス、誤認識パターンなどの容量が増加することは、単に記憶装置に大きな容量を要するという問題だけでなく、検索処理に時間がかかるという問題も引き起こしていた。
【００１０】
また、検索漏れを少なくしようとして、インデックスに複数候補を登録したり誤認識パターンを充実させると、その一方で、検索文字列とは元来関係のない文字列まで、検索にてヒットするおそれがある。つまり、検索結果に「ゴミ」（検索誤り）が多く含まれることになって、検索結果の信頼性が低くなるという問題もあった。
【００１１】
本発明は上記問題点を解消するためになされたもので、曖昧検索に用いるためのデータを少なくする一方で、検索漏れを低減するとともに、検索誤りの影響を軽減するテキスト検索装置を提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明に係るテキスト検索装置は、検索対象テキストに対し、検索文字列に基づいて検索処理を行うテキスト検索装置において、検索条件として、検索文字列及び前記検索文字列とその置換文字列との間に指定された一致度の入力を受け付ける入力手段と、入力された前記検索文字列と前記一致度に応じて前記検索文字列の置換文字列を発生する置換文字列発生手段と、発生された前記置換文字列と一致する文字列パターンを含んだ候補文字列を前記検索対象テキスト中に探索する候補探索手段と、前記検索対象テキストの生成で生じうる誤り文字列パターンを登録した誤り文字列登録手段と、前記候補文字列中の前記検索文字列と異なる部分に、前記誤り文字列登録手段に登録された登録誤り文字列パターンを検出する誤り文字列検出手段と、前記検索文字列との一致可能性に応じた優先度を前記登録誤り文字列パターンの検出頻度に基づいて定め、前記候補文字列中における前記登録誤り文字列パターンの検出に対応して、前記優先度を前記候補文字列に対し付与するものである。
【００１３】
「一致度」は、検索文字列の置換文字列の当該検索文字列に対する一致度であり、例えば検索文字列の文字数と置換文字列のうち置換されずに残っている文字数との比に基づいて定義することができる。指定される一致度は、数値の範囲指定であってもよいし、閾値を示すものであってもよい。「置換文字列」は、検索文字列の一部の文字をマスキングしたものであり、それを構成する文字は元の検索文字列における位置の情報を保持している。例えば、検索文字列「キーワード」の置換文字列「キー＃＃ド」「キ＃＃ード」（＃はマスキングされた文字を表す。）は互いに同一の文字の組で構成されるが、マスキング位置が異なり、異なる置換文字列として扱われる。また、この例に示されるように、ある置換文字列を構成する文字は互いに連続する場合だけでなく、構成する文字の間にマスキング位置が配される場合もある。候補探索手段は、置換文字列を構成する各文字の位置に同一の文字が配置される文字列を候補文字列として、検索対象テキストから抽出する。つまり、候補文字列の抽出において、検索文字列のうちマスキングされた位置に来る文字の一致／不一致は問われない。
【００１５】
文字認識等により生成される検索対象テキストは、誤った文字列を含みうるが、その誤り文字列パターンはランダムではなく、元の正しい文字、又は文字列に対して発生しやすいパターンが存在し得る。誤り文字列登録部には、主としてそのような発生しやすい誤り文字列パターンが格納される。本発明は、候補文字列のうち検索文字列と異なる部分に、誤り文字列登録部に格納された誤りパターンを検知する。そして例えば、検知された誤りパターンが当該検索文字列中の対応部分の文字列に対するものである場合、誤りパターンの部分は検索対象テキスト生成前においては正しい文字列であった可能性が高いと判断して、一致可能性に応じた優先度を高く定めることができる。
【００１６】
本発明の好適な態様は、前記検索文字列とその置換文字列との間の指定された一致度に応じて前記検索文字列の置換文字列を発生する置換文字列発生手段を有し、前記候補文字列は、発生された前記置換文字列と一致する文字列パターンを含んだ文字列であるものである。
【００１７】
本発明に係るテキスト検索装置は、上記発明において前記候補探索手段が、前記発生した置換文字列に含まれる誤認識許容文字を、前記検索対象テキスト中の任意の１文字とみなす手段を有して前記探索を行うことや、前記検索対象テキスト中の任意の２文字とみなす手段を有して前記探索を行うことや、また、前記発生した置換文字列に含まれる誤認識許容文字のうち連続する２つを、前記検索対象テキスト中の任意の１文字とみなす手段を有して前記探索を行うことのいずれか、またはいくつかを備えたことを特徴とするものである。
【００１８】
これらにより、それぞれ文字誤り、誤結合、誤分割を誤りパターンとする候補文字列を検索することができる。
【００１９】
本発明に係るテキスト検索装置は、前記優先度付与手段が、前記登録誤り文字列パターンの検出頻度に応じて前記優先度を定めることを特徴とするものである。本発明によれば、例えば、検出頻度が高い登録誤り文字列パターンに対する元の文字列はそのような誤りを生じやすいと判断され、高い優先度を与えることができる。
【００２０】
本発明の好適な態様は、前記誤り文字列登録部が、前記登録誤り文字列パターンに加えてさらにその検出頻度を格納するものである。
【００２１】
本発明に係るテキスト検索装置は、前記優先度に応じて前記候補文字列を表示する候補文字列表示手段を有するものである。本発明によれば、ユーザは、優先度に基づいて、複数の候補文字列における検索文字列に一致する可能性を把握することができ、例えば、検索処理の結果をチェックする際に便利である。
また、本発明に係るテキスト検索方法は、コンピュータ上で検索対象テキストに対し、検索文字列に基づいて検索処理を行うテキスト検索方法において、前記コンピュータが有する入力手段が、検索条件として、検索文字列及び前記検索文字列とその置換文字列との間に指定された一致度の入力を受け付けるステップと、前記コンピュータが有する置換文字列発生手段が、入力された検索文字列と前記一致度に応じて前記検索文字列の置換文字列を発生するステップと、前記コンピュータが有する候補探索手段が、発生された前記置換文字列と一致する文字列パターンを含んだ候補文字列を前記検索対象テキスト中に探索するステップと、前記コンピュータが有する文字列登録手段が、前記検索対象テキストの生成で生じうる誤りパターンを登録するステップと、前記コンピュータが有する誤り文字列検出手段が、前記候補文字列中の前記検索文字列と異なる部分に、前記文字列登録手段に登録された登録誤り文字列パターンを検出するステップと、前記コンピュータが有する優先度付与手段が、前記検索文字列との一致可能性に応じた優先度を前記登録誤り文字列パターンの検出頻度に基づいて定め、前記候補文字列中における前記登録誤り文字列パターンの検出に対応して、前記優先度を前記候補文字列に対し付与するステップであること、を特徴とする。
【００２２】
【発明の実施の形態】
次に、本発明の実施形態について図面を参照して説明する。
【００２３】
［実施形態１］
図１は、本発明の実施形態であるテキスト検索装置の概略のブロック構成図である。本装置は、ＯＣＲによって文字認識されたテキストを検索の対象とし、インデックス記憶部２、入力部４、検索部６、対象文字位置情報記憶部８、検索文字列展開部１０、マッチング部１２、出力部１４を含んで構成される。
【００２４】
インデックス記憶部２は、ＯＣＲにより得られた検索対象テキストをインデックスの形式で、検索に先立って格納している。インデックスは、検索対象テキストに出現する文字をキーとして、それに当該文字の出現位置を対応付けたものである。
【００２５】
入力部４は、検索文字列や一致度といった検索条件をユーザから受け付ける。
【００２６】
検索部６は、入力部４から検索文字列を得て、それに含まれる各文字にてインデックス記憶部２に記憶されたインデックスを検索して、検索文字列の各文字の出現位置を対象文字位置情報記憶部８へ出力し、対象文字位置情報記憶部８はこれを格納する。
【００２７】
検索文字列展開部１０は、検索文字列の部分列を発生する部分列発生手段であり、入力部４から検索文字列と一致度を得て、その一致度に応じて、検索文字列からその部分列を含んだ置換文字列を展開・生成する。置換文字列は、検索文字列の一部の文字を例えば記号「＃」で置換して、元の文字をマスキングすることにより生成される。置換文字列のうち「＃」で置換された部分以外は、検索文字列の元の文字で構成された部分列である。
【００２８】
マッチング部１２は、検索文字列展開部１０から出力される置換文字列を用いて、対象文字位置情報記憶部８に格納された対象文字位置情報とのマッチングを行う。そのマッチング結果は出力部１４へ出力され、ＣＲＴ等の表示装置に検索結果として画面表示される。
【００２９】
次に、具体的な例を用いて、各構成部の動作を説明する。図２は、検索対象テキストのイメージを示す模式図である。ここで例に用いる検索対象テキストは文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａにはその先頭から１０文字目から文字列「ペルシャ」が存在する。同様に文書Ｂにはその先頭から５文字目から文字列「ベルシャ」が存在し、文書Ｃにはその先頭から２１文字目から文字列「ペノレシャ」が存在する。
【００３０】
図３は、これらの検索対象テキストに対して生成され、インデックス記憶部２に格納されているインデックスのイメージを示す模式図である。インデックスは、検索対象テキストに出現する文字の種類（図中、左端に示す。）をキーとして、当該文字種が現れる文書中の位置をキーごとに分類したものである。その文字の出現位置は、図中、文書Ａ〜Ｃを区別する番号Ｎdoc（文書Ａは“１”、文書Ｂは“２”、文書Ｃは“３”）と、各文書の先頭からの文字数Ｎcharとの組（Ｎdoc，Ｎchar）の形式で表されている。
【００３１】
ユーザは検索条件として、検索文字列「ペルシャ」、一致度７０％を、入力部４に対し入力する。ここでは、一致度ηは、検索文字列に対する部分列の文字数の比で定義される。つまり検索文字列の文字数をＭ、置換文字列のうち置換されずに残っている文字数をｍとすると、一致度η＝ｍ／Ｍ×１００［％］となる。入力部４に入力される一致度は、ηの閾値ηthであり、本装置は検索対象テキスト中にηthを超える一致度を有する文字列を探索する。なお、一致度の閾値ηthが低いと検索結果に含まれる「ゴミ」が増えるため、閾値ηthの好適な値は、一般に７０％程度若しくはそれを上回る値である。一方、閾値ηthが必要以上に高いと検索漏れを生じる可能性が高くなる。その点も考慮して、ここではηth＝７０％に設定した。
【００３２】
入力部４は、検索文字列「ペルシャ」を検索部６へ通知する。検索部６はこの検索文字列を得ると、それを構成する各文字「ペ」、「ル」、「シ」、「ャ」をキーとしてインデックス記憶部２を検索し、その結果を対象文字位置情報記憶部８に格納する。具体的には、この例では文字「ペ」に対する出現位置（1,10）、（3,21）、文字「ル」に対する出現位置（1,11）、（2,6）、文字「シ」に対する出現位置（1,12）、（2,7）、（3,24）、文字「ャ」に対する出現位置（1,13）、（2,8）、（3,25）が対象文字位置情報記憶部８に格納される。図４は、対象文字位置情報記憶部８に格納される対象文字位置情報のイメージを示す模式図である。
【００３３】
検索文字列展開部１０は、入力部４から検索文字列と一致度の閾値ηthを受け取って、検索文字列のうち、一致度に応じた数の文字を誤認識許容文字で置換した置換文字列を生成する。誤認識許容文字（曖昧文字）を、ここでは記号「＃」にて表わす。誤認識許容文字が置かれた部分の検索文字列の文字はマスキングされる。マスキングとは、後述する置換文字列と検索対象テキストとのマッチングにおいて、両者の異同を問わないことを意味する。
【００３４】
ここでは検索文字列の文字数Ｍ＝４であるので、一致度の閾値ηth＝７０％を満たす部分列の文字数ｍは３または４である。よって、検索文字列展開部１０は、誤認識許容文字を全く含まない置換文字列（これは検索文字列に等しい。）と誤認識許容文字を１つだけ含む置換文字列を生成する。具体的には、この例では置換文字列として、「ペルシャ」、「＃ルシャ」、「ペ＃シャ」、「ペル＃ャ」、「ペルシ＃」の５つが生成される。
【００３５】
マッチング部１２は、この置換文字列と検索対象テキストとのマッチングを行い、検索文字列に一致する可能性を有する候補文字列を探索する候補探索手段である。マッチングは、置換文字列中での相対的な文字位置と、対象文字位置情報記憶部８に格納された出現位置を照合することにより行われる。以下、α、βを置換文字列に現れる通常の文字とする。
【００３６】
例えば、置換文字列中に現れる２文字の部分「αβ」のマッチングは以下のように行われる。まずマッチング部１２はα、βをキーとして対象文字位置情報記憶部８を検索する。ここでα、βに対応する出現位置をそれぞれ（Ｎdoc(α)，Ｎchar(α)）、（Ｎdoc(β)，Ｎchar(β)）とする。マッチング部１２は、文書番号に関してＮdoc(α)＝Ｎdoc(β)であり、かつ文字位置に関してＮchar(β)＝Ｎchar(α)＋１なる出現位置が見出すことにより、連続する２文字「αβ」の存在を検知する。
【００３７】
次に、誤認識許容文字「＃」を含んだ文字列部分「α＃β」、「α＃＃β」、「α＃＃＃β」、「α＃＃＃…β」等に対するマッチング処理は以下のように行われる。誤認識許容文字「＃」に関する基本的なマッチング規則は以下の３通りである。
【００３８】
(i) 「＃」は任意の１文字と同一とみなされる、
(ii) 「＃」は任意の２文字と同一とみなされる、
(iii)「＃＃」は任意の１文字と同一とみなされる。
【００３９】
(i)は文字誤りに対応した規則である。また(ii)、(iii)はそれぞれ誤分割、誤結合に対応した規則である。
【００４０】
規則(i)は、Ｎdoc(α)＝Ｎdoc(β)かつＮchar(β)＝Ｎchar(α)＋２なる出現位置の探索として実現される。規則(ii)は、Ｎdoc(α)＝Ｎdoc(β)かつＮchar(β)＝Ｎchar(α)＋３なる出現位置の探索として実現される。また規則(iii)は、Ｎdoc(α)＝Ｎdoc(β)かつＮchar(β)＝Ｎchar(α)＋２なる出現位置の探索により実現される。これらの探索により、マッチング部１２は「α＃β」等の文字列パターンの存在を検知する。
【００４１】
マッチング部１２は、置換文字列の各部分について上述のマッチング処理を行って、検索対象テキスト中における置換文字列の存在を検知する。例えば、置換文字列「ペルシャ」に対しては「ペ（1,10）」、「ル（1,11）」、「シ（1,12）」、「ャ（1,13）」が上述の基本的なマッチング規則に適合し、マッチング部１２はマッチング結果として、当該置換文字列とその先頭文字の出現位置との組「ペルシャ（1,10）」を出力する。そして１度マッチしたものは別の置換文字列でマッチしないように対象から除いていく。また、置換文字列「＃ルシャ」に対しては基本規則と規則(i)に基づいて「ル（2,6）」に先行する任意の１文字と「ル（2,6）」、「シ（2,7）」、「ャ（2,8）」が検知され、マッチング部１２はマッチング結果として「＃ルシャ（2,5）」を出力する。また、置換文字列「ペ＃シャ」に対しては基本規則と規則(ii)に基づいて「ペ（3,21）」、これに続く任意の２文字、この任意の２文字に後続する「シ（3,24）」、「ャ（3,25）」が上述のマッチング規則に適合し、マッチング部１２はマッチング結果として、「ペ＃シャ（3,21）」を出力する。なお、置換文字列「ペル＃ャ」、「ペルシ＃」に対しても探索は行われるが、この例ではそれらにヒットする文字列（候補文字列）は存在しない。
【００４２】
出力部１４は、マッチング部１２で得られたマッチング結果に基づいて、画面上に検索結果を表示する。上述のマッチング部１２は、マッチング結果として候補文字列の位置を出力するものであり、出力部１４はそれを例えば、「文書Ａ：（1,10）文書Ｂ：（2,5）文書Ｃ：（3,21）」と表示することができる。その他、誤認識許容文字数によって、完全一致、１文字曖昧、２文字曖昧というようにランキングを行い、それらのグループごとに区分して表示してもよい。
【００４３】
また、出力部１４は、マッチング部１２から得た文書番号と文字位置を基に、検索対象テキストにアクセスして、候補文字列を得てそれを表示してもよい。また、マッチング部１２自体が、候補文字列の位置情報に基づいてインデックス記憶部２にアクセスし、候補文字列をその位置情報と併せて出力部１４へ出力するように構成することもできる。このような構成により、出力部１４は、候補文字列を含んだ内容、例えば「文書Ａ：ペルシャ（1,10）文書Ｂ：ベルシャ（2,5）文書Ｃ：ペノレシャ（3,21）」を表示することができる。
【００４４】
ちなみに本装置は、汎用コンピュータを用いて構成することができ、特に検索部６、検索文字列展開部１０、マッチング部１２の機能は、中央演算処理部（ＣＰＵ：Central Processing Unit）により実行されうる。
【００４５】
本実施形態にて説明した本発明によれば、例えば文字認識において誤って認識されることにより、ある文字又は文字列がどのような誤った文字又は文字列に変換されて検索対象テキストが生成されるかという情報を用いないのにも拘わらず、文字誤り、誤分割、誤結合に対応することができ、検索文字列に一致する可能性のある候補文字列をもれなく検索することができる。
【００４６】
［実施形態２］
図５は、本発明の第２の実施形態であるテキスト検索装置の概略のブロック構成図である。本装置の構成要素のうち上記実施形態と同様のものについては同一の符号を付し説明を簡単にする。本装置は、上記装置の構成に加えて、テキスト記憶部２０、誤り文字列登録部２４、ランキング部２６とをさらに備えた点が主たる相違点である。
【００４７】
テキスト記憶部２０は、検索対象テキストを格納しており、各文書は文書番号を付され互いに区別されうる。
【００４８】
マッチング部２２は、上記実施形態の装置と同様の処理を行って、候補文字列の位置情報を得る。本装置のマッチング部２２は、さらにその位置情報に基づいて、テキスト記憶部２０にアクセスし、候補文字列を取得し出力する。このとき、位置情報も併せて出力することができる。
【００４９】
誤り文字列登録部２４は、文字認識において誤認識されやすい文字又は文字列である誤り文字列パターンを格納している。
【００５０】
ランキング部２６は、マッチング部２２から候補文字列を得ると、当該候補文字列中に誤り文字列登録部２４に登録された誤り文字列パターンを探索する。そして、ランキング部２６はその結果に応じて候補文字列と検索文字列との一致可能性に応じた優先度を定める（優先度付与手段）。ランキング部２６は、候補文字列とその優先度とを出力部１４へ出力する。
【００５１】
次に、具体的な例を用いて、本装置の動作の特徴を説明する。図６は、検索対象テキストのイメージを示す模式図である。ここで例に用いる検索対象テキストは文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａにはその先頭から１０文字目から文字列「スキャナ」が存在する。同様に文書Ｂにはその先頭から５文字目から文字列「スキャン」が存在し、文書Ｃにはその先頭から２１文字目から文字列「スキヤナ」が存在する。
【００５２】
図７は、これらの検索対象テキストに対して生成され、インデックス記憶部２に格納されているインデックスのイメージを示す模式図である。
【００５３】
ユーザは検索条件として、検索文字列「スキャナ」、一致度７０％を、入力部４に対し入力する。
【００５４】
入力部４は、検索文字列「スキャナ」を検索部６へ通知する。検索部６はこの検索文字列を得ると、上記実施形態と同様、それを構成する各文字をキーとしてインデックス記憶部２を検索し、その結果を対象文字位置情報記憶部８に格納する。
【００５５】
検索文字列展開部１０は、入力部４から検索文字列と一致度の閾値ηthを受け取って、それに応じた置換文字列を生成する。
【００５６】
ここでは検索文字列の文字数Ｍ＝４、及び一致度の閾値ηth＝７０％に基づいて、検索文字列展開部１０は、誤認識許容文字を全く含まない置換文字列と誤認識許容文字を１つだけ含む置換文字列を生成する。具体的には、この例では置換文字列として、「スキャナ」、「＃キャナ」、「ス＃ャナ」、「スキ＃ナ」、「スキャ＃」の５つが生成される。
【００５７】
マッチング部２２は、この置換文字列と検索対象テキストとのマッチングを行い、検索対象テキスト中における置換文字列の存在を検知する。例えば、置換文字列「スキャナ」に対しては「ス（1,10）」、「キ（1,11）」、「ャ（1,12）」、「ナ（1,13）」がマッチする。マッチング部２２は、この位置情報に基づいて、テキスト記憶部２０に格納された検索対象テキストから候補文字列「スキャナ」を取得し、これとその先頭文字の出現位置との組「スキャナ（1,10）」を、ランキング部２６へ出力する。また、置換文字列「スキャ＃」に対しては上記実施形態で述べた基本規則と規則(i)に基づいて「ス（2,5）」、「キ（2,6）」、「ャ（2,7）」及びこれに後続する任意の１文字がマッチする。マッチング部２２は、この位置情報を基にテキスト記憶部２０にアクセスして候補文字列「スキャン」を取得し、マッチング結果として「スキャン（2,5）」を出力する。また、置換文字列「スキ＃ナ」に対しては基本規則と規則(i)に基づいて「ス（3,21）」、「キ（3,22）」、これに続く任意の１文字、この任意の１文字に後続する「ナ（3,24）」がマッチする。マッチング部２２はこの位置情報を基にテキスト記憶部２０にアクセスして候補文字列「スキヤナ」を取得し、マッチング結果として「スキヤナ（3,21）」を出力する。
【００５８】
ランキング部２６は、マッチング部２２からマッチング結果を得ると、候補文字列のランキングを行う。ここでランキングは、候補文字列が検索文字列に一致する可能性に応じた優先度を定める処理であり、候補文字列が検索文字列と異なる部分（誤り文字列）の文字数と誤り文字列登録部２４に誤り文字列パターンとして登録されているかどうかに基づいて定められる。
【００５９】
図８は、誤り文字列登録部２４に登録された誤り文字列パターンの一例を示す模式図である。図は、検索対象テキストを生成する際の文字認識において、「→」の左側の文字又は文字列が、右側の文字又は文字列と誤って認識されやすいことを示している。例えば、「ス」は「イ」に、「ャ」は「ヤ」や「ゃ」に、「ナ」は「メ」に、「ル」は「ノレ」に誤って認識されやすいことを示している。
【００６０】
ランキング部２６は、例えば、候補文字列が検索文字列と完全一致の場合には、優先度を表す数値としてポイント「１００」を付与し、１文字不一致の場合にはポイント「１０」を付与する。その上でランキング部２６は、候補文字列と検索文字列との差分である誤り文字列が、誤り文字列登録部２４に誤り文字列パターンとして登録されているかどうかを調べ、もし登録されている場合は、既に獲得しているポイントに、例えば「４０」ポイントを加える。
【００６１】
よって、例えば、候補文字列「スキャナ」は完全一致であるので、ポイント「１００」を獲得し、候補文字列「スキヤナ」は１文字不一致で、さらに誤り文字列登録部２４に「ャ→ヤ」が登録されているので、それぞれのポイント「１０」、「４０」を加算したポイント「５０」を得る。一方、候補文字列「スキャン」は１文字不一致であるが、誤り文字列登録部２４にその誤り文字列が登録されていないので、ポイント「１０」のみを得る。そして、ランキング部２６は、例えば、ランキング結果として、候補文字列とその位置情報とポイントの組、例えば「スキャナ（1,10,100）」、「スキヤナ（2,5,50）」、「スキャン（3,21,10）」を出力部１４へ出力する。
【００６２】
出力部１４は、ランキング部２６からのランキング結果を得ると、それに含まれるポイントを用いた表示を行うことができる。例えば、ポイントが高い、すなわち検索文字列と一致する可能性が高い順に、候補文字列を画面表示するといったことができる。また、出力部１４は、ある値以上のポイントを得た候補文字列のみを表示してもよいし、ポイントが指定された範囲内にあるものをグループ化して表示してもよい。
【００６３】
上述のランキング部２６は、誤り文字列登録部２４に登録された誤り文字列パターンに対して一定のポイントを付与したが、必ずしも付与されるポイントは一律でなくてもよい。例えば、誤り文字列登録部２４に各誤り文字列パターンの検出頻度などで表される誤りやすさの度合いを格納し、これをランキングに反映させることにより、より詳細なランキングを行うことができる。例えば、誤りやすさを０〜１の調整係数で設定し、ポイントは、誤り文字列パターン共通のポイントに誤りやすさの調整係数を乗じるといった方法がある。このような方法では、例えば、上述の例において候補文字列「スキヤナ」の調整係数を０．８とすれば、そのポイントは１０＋４０×０．８＝４２となるわけである。また、ユーザが検索結果に基づいて、誤り文字列パターンの検出頻度を増減するように構成することができる。
【００６４】
本実施形態にて説明した本発明によれば、第一の実施形態で説明した発明と同様、検索処理のうちマッチング自体は、誤り文字列登録部２４に登録された誤り文字列パターンを必要とせずに、文字誤り、誤分割、誤結合に対応することができ、検索文字列に一致する可能性のある候補文字列をもれなく検索することができる。このもれなく検索することにより、検索文字列との一致可能性が低いものも候補文字列として検出され、マッチング結果に含まれる「ゴミ」（検索誤り）の割合が増加することは否めない。本発明は、もれなく検索するとともに、その検索結果をより確からしい順番にて表示することを可能にし、これによりユーザが検索結果を利用する際に各候補文字列の重要度（優先度）を把握することが可能となり、検索誤りが生じても実際の利用におけるその影響を軽減することができる。従来の検索文字列を誤り文字列パターンを用いて展開して検索を行う方法では、誤り文字列パターンを登録した辞書がある程度充実していないと検索もれが多くなり、信頼性が低くなる。これに対し本発明では、誤り文字列登録部２４のデータが無い場合でも、もれなく検索でき、誤り文字列登録部２４のデータを充実させていくことによりランキングの精度を向上させていくことができる。
【００６５】
［実施形態３］
本発明の第３の実施形態は、他の検索文字列を用いた他の検索処理例に係るものであり、本実施形態に係るテキスト検索装置の構成は、上記第二の実施形態の装置と同様である。
【００６６】
図９は、検索対象テキストのイメージを示す模式図である。ここで例に用いる検索対象テキストは文書Ａ、文書Ｂ、文書Ｃの３つである。文書Ａにはその先頭から１０文字目から文字列「アルタリア」が存在する。同様に文書Ｂにはその先頭から５文字目から文字列「アル列ア」が存在し、文書Ｃにはその先頭から２１文字目から文字列「アル夕リア」（“夕”は漢字）が存在する。
【００６７】
図１０は、これらの検索対象テキストに対して生成され、インデックス記憶部２に格納されているインデックスのイメージを示す模式図である。
【００６８】
ユーザは検索条件として、検索文字列「アルタリア」及び、一致度６０％を入力部４に対し入力する。
【００６９】
入力部４は、検索文字列「アルタリア」を検索部６へ通知する。検索部６はこの検索文字列を得ると、上記実施形態と同様、それを構成する各文字をキーとしてインデックス記憶部２を検索し、その結果を対象文字位置情報記憶部８に格納する。
【００７０】
検索文字列展開部１０は、入力部４から検索文字列と一致度の閾値ηthを受け取って、それに応じた置換文字列を生成する。
【００７１】
ここでは検索文字列の文字数Ｍ＝５、及び一致度の閾値ηth＝６０％から誤認識許容文字は２文字許される。検索文字列展開部１０は具体的には、この例では置換文字列として、「アルタリア」、「＃＃タリア」、「＃ル＃リア」、「＃ルタ＃ア」、「＃ルタリ＃」、「ア＃＃リア」、「ア＃タ＃ア」、「ア＃タリ＃」、「アル＃＃ア」、「アル＃リ＃」、「アルタ＃＃」を生成しマッチング部２２へ出力する。
【００７２】
マッチング部２２は、この置換文字列と検索対象テキストとのマッチングを行い、検索対象テキスト中における置換文字列の存在を検知する。例えば、置換文字列「アルタリア」に対しては「ア（1,10）」、「ル（1,11）」、「タ（1,12）」、「リ（1,13）」、「ア（1,14）」がマッチする。マッチング部２２は、この位置情報に基づいて、テキスト記憶部２０に格納された検索対象テキストから候補文字列「アルタリア」を取得し、これとその先頭文字の出現位置との組「アルタリア（1,10）」を、ランキング部２６へ出力する。また、置換文字列「＃ル＃リア」に対しては任意の１文字、これに続く「ル（3,22）」、これに続く任意の１文字、「リ（3,24）」、「ア（3,25）」がマッチする。マッチング部２２は、この位置情報を基にテキスト記憶部２０にアクセスして候補文字列「アル夕リア」（“夕”は漢字）を取得し、マッチング結果として「アル夕リア（3,21）」を出力する。また、置換文字列「アル＃＃ア」に対しては上記第一の実施形態で述べた規則(iii)から「ア（2,5）」、「ル（2,6）」、これに続く１文字、及び「ア（2,8）」がマッチする。マッチング部２２はこの位置情報を基にテキスト記憶部２０にアクセスして候補文字列「アル列ア」を取得し、マッチング結果として「アル列ア（2,5）」を出力する。
【００７３】
ランキング部２６は、マッチング部２２からマッチング結果を得ると、候補文字列のランキングを行う。本装置では、ランキング部２６が付与するポイントは、誤認識許容文字が２つの場合に拡張され、その場合に生じ得るそれぞれのケースについて定められている。例えば、以下のように定めることができる。
【００７４】

【００７５】
また、誤り文字列登録部２４に登録された誤り文字列パターンには、「タリ→列」、「タ（カタカナ）→夕（漢字）」が含まれているものとする。
【００７６】
ランキング部２６は、候補文字列「アルタリア」に対しては完全一致の場合のポイント「１００」を付与し、「アル列ア」は２文字不一致かつ誤り文字列「タリ→列」が誤り文字列登録部２４に登録されているので、１０＋６０＝７０ポイントを付与される。また、候補文字列「アル夕リア」（“夕”は漢字）は１文字不一致かつ誤り文字列「タ→夕」が誤り文字列登録部２４に登録されているので、５０＋３０＝８０ポイントを付与される。そして、ランキング部２６は、例えば、ランキング結果として、候補文字列とその位置情報とポイントの組、例えば「アルタリア（1,10,100）」、「アル列ア（2,5,70）」、「アル夕リア（3,21,80）」を出力部１４へ出力する。
【００７７】
出力部１４は、ランキング部２６からのランキング結果を得ると、例えば、ポイントが高い順に、候補文字列を画面表示する。また、出力部１４は、ある値以上のポイントを得た候補文字列のみを表示してもよいし、ポイントが指定された範囲内にあるものをグループ化して表示してもよい。
【００７８】
なお、誤り文字列登録部２４を用いたランキングではなく、簡単に、完全一致、１文字曖昧、２文字曖昧というランキングを行うことも可能である。
【図面の簡単な説明】
【図１】本発明の第一の実施形態であるテキスト検索装置の概略のブロック構成図である。
【図２】第一の実施形態に係る検索対象テキストのイメージを示す模式図である。
【図３】第一の実施形態に係る検索対象テキストのインデックスのイメージを示す模式図である。
【図４】対象文字位置情報記憶部に格納される対象文字位置情報のイメージを示す模式図である。
【図５】本発明の第二の実施形態であるテキスト検索装置の概略のブロック構成図である。
【図６】第二の実施形態に係る検索対象テキストのイメージを示す模式図である。
【図７】第二の実施形態に係る検索対象テキストのインデックスのイメージを示す模式図である。
【図８】誤り文字列登録部に登録された誤り文字列パターンの一例を示す模式図である。
【図９】第三の実施形態に係る検索対象テキストのイメージを示す模式図である。
【図１０】検索対象テキストに対して生成され、インデックス記憶部に格納されているインデックスのイメージを示す模式図である。
【符号の説明】
２インデックス記憶部、４入力部、６検索部、８対象文字位置情報記憶部、１０検索文字列展開部、１２，２２マッチング部、１４出力部、２０テキスト記憶部、２４誤り文字列登録部、２６ランキング部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text search apparatus for text that may have an inaccurate character string, and particularly relates to reduction of search omission and weighting in search results.
[0002]
[Prior art]
Conventionally, there has been a text search device that specifies a search character string and searches for the search character string included in a document or character string. The character string search function installed in the word processor is an example of such a case.
[0003]
It is assumed that the documents and character strings to be searched are basically free from errors. In the search, it is determined that the character string of interest exists in the search target text only when the character string included in the search target text completely matches the search character string.
[0004]
On the other hand, when the text to be searched is text data read by an optical character reader (OCR), there is a probability that an incomplete text including an inaccurate character string is generated due to a recognition error in the reading. high. Japanese OCR has a low accuracy, so it is particularly likely. If this incomplete text is searched with a complete match with the search character string as described above, there is a possibility that a search omission may occur. In other words, if the search target text is correctly read, the character string portion that should have been hit may not be hit due to a recognition error.
[0005]
In order to prevent such a search omission, there is a technique (hereinafter referred to as an ambiguous search) for performing a search with a certain degree of ambiguity of the search target. The prior art of the first fuzzy search disclosed in Japanese Patent Laid-Open No. 62-44878 is to search by embedding candidate characters in the search target text when a plurality of candidates are obtained as a result of recognition. (Example: [Honoki] sentence by sentence [Language] recognition [Shiori] ...). The prior art of the second fuzzy search disclosed in Japanese Patent Application Laid-Open No. 8-7033 is to leave them in the index when a plurality of candidates are obtained for each character that has been subjected to character recognition. In this case, in the index storing the recognition result for each character, a plurality of recognition result character candidates can be stored for one character to be recognized. These two techniques provide ambiguity to the recognition result, that is, the search target text. On the other hand, the third prior art of ambiguous search disclosed in Japanese Patent Laid-Open Nos. 6-195387, 7-152774, and 8-63487 has ambiguity on the side of the search character string. Is. This method creates an incomplete search string by replacing the erroneously recognized part of the search string with a character pattern that may be misrecognized (misrecognition pattern). The search is also performed by the incomplete search character string. The types of misrecognition patterns include character error, misdivision, and miscombination. For example, “character” is easily recognized as “study”, but such a type is a character error. In addition, “K” is easily recognized as “Ihi”, but such a type is misdivided. On the other hand, “51” is easily recognized as “Draw”, but such a type is erroneously combined.
[0006]
While fuzzy search has the advantage that search omissions can be reduced, conversely, there is a possibility that a search error that originally matches a character string that is different from the search character string is included. is there.
[0007]
[Problems to be solved by the invention]
The prior art of the first fuzzy search described above has a problem that the capacity of the search target text that is the recognition result increases, and the search cannot be performed unless it remains in the recognition result. In addition, there is a problem in that misrecognition patterns and miscombination cannot be handled.
[0008]
Even in the second prior art of the fuzzy search, there is a problem that the capacity of the index that is the recognition result increases, and the search is not performed unless it remains in the recognition result.
[0009]
The third prior art of fuzzy search has a problem that it is necessary to prepare a misrecognition pattern separately from the search character string, and its capacity increases. Also, for example, misconnection occurs depending on the combination of consecutive characters, so that many patterns can exist. It is difficult to prepare in advance all the erroneous recognition patterns that can occur in this way. If an erroneous recognition pattern that is not prepared in advance is generated, there is a problem that a search is omitted. Increasing the capacity of search target texts, indexes, misrecognition patterns, etc. used for search has caused not only a problem that the storage device requires a large capacity but also a problem that it takes a long time for the search process.
[0010]
Also, if you try to reduce search omissions and register multiple candidates in the index or enrich the misrecognition pattern, on the other hand, there is a risk of hitting a character string that is not originally related to the search character string. is there. That is, there is a problem that the search result includes a lot of “garbage” (search error), and the reliability of the search result is lowered.
[0011]
The present invention has been made to solve the above problems, and provides a text search device that reduces data used for ambiguous search while reducing search omission and reducing the influence of search errors. With the goal.
[0012]
[Means for Solving the Problems]
The text search device according to the present invention is a text search device that performs a search process on a search target text based on a search character string. Replacement string And an input means for accepting an input of the degree of matching specified between the search character string and the search character string inputted according to the degree of matching Replacement string Generate Replacement string Generating means and said generated Replacement string Candidate search means for searching the search target text for a candidate character string including a character string pattern that matches An error character string registration unit that registers an error character string pattern that may occur in the generation of the search target text, and a registration error that is registered in the error character string registration unit in a portion different from the search character string in the candidate character string An error character string detection means for detecting a character string pattern and a priority according to the possibility of matching with the search character string are determined based on the detection frequency of the registered error character string pattern, and the registration in the candidate character string Corresponding to detection of an error character string pattern, the priority is given to the candidate character string Is.
[0013]
The “matching degree” is the matching degree of the replacement character string of the search character string with respect to the search character string. For example, Based on the ratio of the number of characters in the search string and the number of characters remaining in the replacement string that are not replaced Can be defined. The specified degree of coincidence may be a numerical range specification or a threshold value. The “replacement character string” is obtained by masking some characters of the search character string, and the characters constituting the character string hold position information in the original search character string. For example, the replacement character strings “key ##” and “key ##” (# represents a masked character) of the search character string “keyword” are composed of the same set of characters, but are masked. It is treated as a different replacement character string at different positions. Further, as shown in this example, not only the characters constituting a certain replacement character string are continuous with each other, but also a masking position may be arranged between the constituting characters. The candidate search means extracts, from the search target text, a character string in which the same character is arranged at the position of each character constituting the replacement character string as a candidate character string. That is, in the extraction of the candidate character string, the match / mismatch of characters at the masked position in the search character string is not questioned.
[0015]
The search target text generated by character recognition or the like may include an erroneous character string, but the erroneous character string pattern is not random, and there may be a pattern that is likely to occur with respect to the original correct character or character string. . The error character string registration unit mainly stores such error character string patterns that are likely to occur. The present invention detects an error pattern stored in the error character string registration unit in a portion different from the search character string in the candidate character string. For example, if the detected error pattern is for the corresponding character string in the search character string, it is determined that the error pattern portion is likely to be a correct character string before the search target text is generated. Thus, the priority according to the possibility of matching can be set high.
[0016]
According to a preferred aspect of the present invention, the search character string and the search character string Replacement string Depending on the specified degree of match between and Replacement string Generate Replacement string Generating means, wherein the candidate character string is generated Replacement string Is a character string including a character string pattern that matches.
[0017]
In the text search device according to the present invention, the candidate search means in the invention described above Replacement string Having a means for considering the misrecognized allowable character included in the text as an arbitrary one character in the search target text, and a means for considering the character as an arbitrary two characters in the search target text Performing the search and also the occurred Replacement string One or more of performing the search with means for regarding two consecutive recognition-acceptable characters included in the text as one arbitrary character in the search target text. It is what.
[0018]
As a result, it is possible to search for candidate character strings having an error pattern of character error, misconnection, and error division, respectively.
[0019]
Book The text search apparatus according to the invention is characterized in that the priority assigning means determines the priority according to the detection frequency of the registration error character string pattern. According to the present invention, for example, it is determined that an original character string for a registered error character string pattern with a high detection frequency is likely to cause such an error, and a high priority can be given.
[0020]
In a preferred aspect of the present invention, the error character string registration unit further stores the detection frequency in addition to the registration error character string pattern.
[0021]
The text search device according to the present invention includes candidate character string display means for displaying the candidate character string in accordance with the priority. According to the present invention, the user can grasp the possibility of matching a search character string among a plurality of candidate character strings based on the priority, and is convenient when, for example, checking the result of search processing. .
The text search method according to the present invention is a text search method for performing search processing on a search target text on a computer based on a search character string, wherein the input means included in the computer uses a search character string as a search condition. And the search string and its search string Replacement string Receiving the input of the degree of coincidence specified between and the computer Replacement string The generation means determines the search character string according to the input search character string and the matching degree. Replacement string And candidate search means possessed by the computer are generated. Replacement string Searching for a candidate character string including a character string pattern matching ,in front The character string registration means possessed by the computer registers an error pattern that may occur in the generation of the search target text, and the error character string detection means possessed by the computer differs from the search character string in the candidate character string In part, a step of detecting a registration error character string pattern registered in the character string registration means, and a priority giving means that the computer has, A priority according to the possibility of matching with the search character string is determined based on the detection frequency of the registration error character string pattern, and in response to the detection of the registration error character string pattern in the candidate character string, the priority Give degree to the candidate string It is a step.
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0023]
[Embodiment 1]
FIG. 1 is a schematic block diagram of a text search apparatus according to an embodiment of the present invention. This apparatus uses text recognized by OCR as a search target, index storage unit 2, input unit 4, search unit 6, target character position information storage unit 8, search character string expansion unit 10, matching unit 12, and output. The unit 14 is configured.
[0024]
The index storage unit 2 stores the search target text obtained by OCR in the form of an index prior to the search. The index is obtained by associating the appearance position of the character with the character appearing in the search target text as a key.
[0025]
The input unit 4 receives a search condition such as a search character string and a matching degree from the user.
[0026]
The search unit 6 obtains a search character string from the input unit 4, searches the index stored in the index storage unit 2 with each character included therein, and determines the appearance position of each character in the search character string as the target character position It outputs to the information storage part 8, and the object character position information storage part 8 stores this.
[0027]
The search character string expansion unit 10 is a partial string generation unit that generates a substring of the search character string. The search character string development unit 10 obtains a matching degree with the search character string from the input unit 4, and from the search character string according to the matching degree, Expand and generate a replacement string that includes a substring. The replacement character string is generated by replacing some characters of the search character string with, for example, the symbol “#” and masking the original character. The part other than the part replaced with “#” in the replacement character string is a partial string composed of the original characters of the search character string.
[0028]
The matching unit 12 performs matching with the target character position information stored in the target character position information storage unit 8 using the replacement character string output from the search character string expansion unit 10. The matching result is output to the output unit 14 and displayed on the screen as a search result on a display device such as a CRT.
[0029]
Next, the operation of each component will be described using a specific example. FIG. 2 is a schematic diagram showing an image of the search target text. Here, the search target texts used in the example are document A, document B, and document C. In the document A, the character string “Persia” exists from the 10th character from the beginning. Similarly, the document B includes the character string “Bersha” from the fifth character from the beginning, and the document C includes the character string “Penoresha” from the 21st character from the beginning.
[0030]
FIG. 3 is a schematic diagram illustrating an index image generated for these search target texts and stored in the index storage unit 2. The index is obtained by classifying the position in the document in which the character type appears in each key using the type of character appearing in the search target text (shown at the left end in the figure) as a key. The appearance position of the character is a number Ndoc (document A is “1”, document B is “2”, document C is “3”) and the number of characters from the beginning of each document. It is expressed in the form of a pair (Ndoc, Nchar) with Nchar.
[0031]
As a search condition, the user inputs a search character string “Persia” and a matching degree of 70% to the input unit 4. Here, the degree of matching η is defined by the ratio of the number of characters in the substring to the search character string. That is, if the number of characters in the search character string is M and the number of characters remaining in the replacement character string that are not replaced is m, the degree of matching is η = m / M × 100 [%]. The matching degree input to the input unit 4 is a threshold value ηth of η, and this apparatus searches for a character string having a matching degree exceeding ηth in the search target text. Note that, if the threshold value ηth of the matching degree is low, “dust” included in the search result increases. Therefore, a preferable value of the threshold value ηth is generally about 70% or more. On the other hand, if the threshold ηth is higher than necessary, there is a high possibility that a search omission will occur. Considering this point, ηth = 70% is set here.
[0032]
The input unit 4 notifies the search unit 6 of the search character string “Persia”. When the search unit 6 obtains the search character string, the search unit 6 searches the index storage unit 2 using each of the characters “pe”, “le”, “si”, and “ja” as a key, and the result is the target character position. The information is stored in the information storage unit 8. Specifically, in this example, the appearance positions (1,10) and (3,21) for the character “pe”, the appearance positions (1,11) and (2,6) for the character “le”, and the character “si” Appearance position (1,12), (2,7), (3,24), and appearance position (1,13), (2,8), (3,25) for the character " It is stored in the storage unit 8. FIG. 4 is a schematic diagram showing an image of target character position information stored in the target character position information storage unit 8.
[0033]
The search character string expansion unit 10 receives the threshold value ηth of the matching degree with the search character string from the input unit 4, and replaces the number of characters corresponding to the matching degree among the search character strings with the misrecognized allowable characters. Is generated. A misrecognized allowable character (an ambiguous character) is represented by a symbol “#” here. The characters in the search character string where the erroneous recognition allowable characters are placed are masked. The masking means that the difference between the replacement character string and the search target text, which will be described later, does not matter.
[0034]
Here, since the number M of characters in the search character string is 4, the number m of characters in the substring satisfying the matching threshold ηth = 70% is 3 or 4. Therefore, the search character string expansion unit 10 generates a replacement character string that does not include any misrecognized characters (this is equal to the search character string) and a replacement character string that includes only one misrecognized character. Specifically, in this example, five replacement strings “Persia”, “#Lucia”, “Pécha”, “Persia”, and “Persi #” are generated.
[0035]
The matching unit 12 is candidate search means for matching the replacement character string with the search target text and searching for a candidate character string that has a possibility of matching the search character string. Matching is performed by collating the relative character position in the replacement character string with the appearance position stored in the target character position information storage unit 8. Hereinafter, α and β are normal characters appearing in the replacement character string.
[0036]
For example, the matching of the two-character part “αβ” appearing in the replacement character string is performed as follows. First, the matching unit 12 searches the target character position information storage unit 8 using α and β as keys. Here, the appearance positions corresponding to α and β are (Ndoc (α), Nchar (α)) and (Ndoc (β), Nchar (β)), respectively. The matching unit 12 finds the appearance position of Ndoc (α) = Ndoc (β) with respect to the document number and Nchar (β) = Nchar (α) +1 with respect to the character position. Detect presence.
[0037]
Next, the matching process for character string portions “α # β”, “α ## β”, “α #### β”, “α ####... This is done as follows. There are three basic matching rules for the misrecognized character “#” as follows.
[0038]
(i) "#" is considered to be the same as any single character,
(ii) "#" is considered the same as any two characters,
(iii) “##” is regarded as the same as any one character.
[0039]
(i) is a rule corresponding to a character error. Also, (ii) and (iii) are rules corresponding to erroneous division and incorrect coupling, respectively.
[0040]
Rule (i) is realized as a search for an appearance position where Ndoc (α) = Ndoc (β) and Nchar (β) = Nchar (α) +2. Rule (ii) is realized as a search for the appearance position where Ndoc (α) = Ndoc (β) and Nchar (β) = Nchar (α) +3. The rule (iii) is realized by searching for an appearance position where Ndoc (α) = Ndoc (β) and Nchar (β) = Nchar (α) +2. Through these searches, the matching unit 12 detects the presence of a character string pattern such as “α # β”.
[0041]
The matching unit 12 performs the above-described matching process for each part of the replacement character string, and detects the presence of the replacement character string in the search target text. For example, for the replacement character string “Persia”, “Pe (1,10)”, “Le (1,11)”, “Shi (1,12)”, “ja (1,13)” In accordance with the basic matching rule, the matching unit 12 outputs a pair “Persia (1, 10)” of the replacement character string and the appearance position of the first character as a matching result. Then, once matched, it is removed from the target so as not to match with another replacement character string. For the replacement character string “#Lucia”, an arbitrary one character preceding “Le (2,6)” and “Le (2,6)”, “ (2,7) "and" (2,8) "are detected, and the matching unit 12 outputs"#Lucia (2,5) "as a matching result. Also, for the replacement character string “Pesha”, “Pe (3, 21)”, any two characters following it, and “ “(3, 24)” and “(3, 25)” conform to the above matching rule, and the matching unit 12 outputs “Pesha (3, 21)” as a matching result. Although the search is also performed for the replacement character strings “Per #” and “Persi #”, in this example, there is no character string (candidate character string) that hits them.
[0042]
The output unit 14 displays the search result on the screen based on the matching result obtained by the matching unit 12. The above-mentioned matching unit 12 outputs the position of the candidate character string as a matching result, and the output unit 14 outputs it as, for example, “Document A: (1,10) Document B: (2,5) Document C: (3, 21) "can be displayed. In addition, ranking may be performed such as complete match, one character ambiguity, and two character ambiguity depending on the number of misrecognition allowable characters, and the classification may be displayed for each group.
[0043]
Further, the output unit 14 may access the search target text based on the document number and the character position obtained from the matching unit 12 to obtain a candidate character string and display it. Further, the matching unit 12 itself may be configured to access the index storage unit 2 based on the position information of the candidate character string and output the candidate character string together with the position information to the output unit 14. With such a configuration, the output unit 14 includes contents including the candidate character string, for example, “Document A: Persia (1,10) Document B: Bersha (2,5) Document C: Penolesha (3, 21)”. Can be displayed.
[0044]
Incidentally, this apparatus can be configured using a general-purpose computer, and in particular, the functions of the search unit 6, the search character string expansion unit 10, and the matching unit 12 can be executed by a central processing unit (CPU). .
[0045]
According to the present invention described in this embodiment, for example, when a character or character string is erroneously recognized, a certain character or character string is converted into any wrong character or character string, and a search target text is generated. Regardless of whether or not information is used, it is possible to deal with character errors, misdivisions, and miscombinations, and search for all candidate character strings that may match the search character string.
[0046]
[Embodiment 2]
FIG. 5 is a schematic block diagram of a text search apparatus according to the second embodiment of the present invention. Of the components of this apparatus, the same components as those in the above embodiment are given the same reference numerals to simplify the description. The main difference of this apparatus is that it further includes a text storage unit 20, an error character string registration unit 24, and a ranking unit 26 in addition to the configuration of the above apparatus.
[0047]
The text storage unit 20 stores search target text, and each document is assigned a document number and can be distinguished from each other.
[0048]
The matching unit 22 performs the same processing as that of the apparatus of the above embodiment, and obtains position information of the candidate character string. The matching unit 22 of this apparatus further accesses the text storage unit 20 based on the position information, and acquires and outputs a candidate character string. At this time, position information can also be output.
[0049]
The error character string registration unit 24 stores an error character string pattern that is a character or a character string that is easily misrecognized in character recognition.
[0050]
When the ranking unit 26 obtains a candidate character string from the matching unit 22, the ranking unit 26 searches for an error character string pattern registered in the error character string registration unit 24 in the candidate character string. Then, the ranking unit 26 determines priority according to the possibility of matching between the candidate character string and the search character string according to the result (priority giving means). The ranking unit 26 outputs the candidate character string and its priority to the output unit 14.
[0051]
Next, the features of the operation of this apparatus will be described using a specific example. FIG. 6 is a schematic diagram illustrating an image of a search target text. Here, the search target texts used in the example are document A, document B, and document C. Document A has a character string “scanner” from the 10th character from the beginning. Similarly, the document B includes the character string “scan” from the fifth character from the top, and the document C includes the character string “scanner” from the 21st character from the top.
[0052]
FIG. 7 is a schematic diagram showing an image of an index generated for these search target texts and stored in the index storage unit 2.
[0053]
As a search condition, the user inputs a search character string “scanner” and a matching degree of 70% to the input unit 4.
[0054]
The input unit 4 notifies the search unit 6 of the search character string “scanner”. When the search unit 6 obtains this search character string, it searches the index storage unit 2 using each of the characters constituting the key as a key, and stores the result in the target character position information storage unit 8 as in the above embodiment.
[0055]
The search character string expansion unit 10 receives a threshold value ηth of coincidence with the search character string from the input unit 4 and generates a replacement character string corresponding to the threshold value ηth.
[0056]
Here, based on the number of characters M = 4 in the search character string and the matching threshold ηth = 70%, the search character string expansion unit 10 sets a replacement character string that does not include any misrecognition allowable character and a misrecognition allowable character as one. Generate a replacement string containing only one. Specifically, in this example, five “scanner”, “# canna”, “scan #”, “ski #na”, and “scan #” are generated as replacement character strings.
[0057]
The matching unit 22 performs matching between the replacement character string and the search target text, and detects the presence of the replacement character string in the search target text. For example, "su (1,10)", "ki (1,11)", "ja (1,12)", "na (1,13)" matches the replacement string "scanner" . Based on this position information, the matching unit 22 acquires the candidate character string “scanner” from the search target text stored in the text storage unit 20, and sets the combination “scanner (1, 1, 10) "is output to the ranking unit 26. Also, for the replacement character string “Sca #”, based on the basic rule and rule (i) described in the above embodiment, “S (2,5)”, “Ki (2,6)”, “ 2,7) ”and any single character that follows it. Based on this position information, the matching unit 22 accesses the text storage unit 20 to acquire the candidate character string “scan”, and outputs “scan (2, 5)” as a matching result. Also, for the replacement character string “skin #na”, based on the basic rule and rule (i), “su (3,21)”, “ki (3,22)”, any one character following this, "NA (3, 24)" following this one character matches. Based on this position information, the matching unit 22 accesses the text storage unit 20 to obtain the candidate character string “scanner”, and outputs “scanner (3, 21)” as a matching result.
[0058]
When the ranking unit 26 obtains a matching result from the matching unit 22, the ranking unit 26 performs ranking of the candidate character strings. Ranking is a process that determines the priority according to the possibility that the candidate character string matches the search character string. The number of characters in the part (error character string) where the candidate character string is different from the search character string and the error character string registration It is determined based on whether or not it is registered in the section 24 as an error character string pattern.
[0059]
FIG. 8 is a schematic diagram illustrating an example of an error character string pattern registered in the error character string registration unit 24. The figure shows that the character or character string on the left side of “→” is easily mistaken for the character or character string on the right side in character recognition when generating the search target text. For example, “S” is easily recognized as “I”, “A” as “Ya” or “N”, “N” as “Me”, and “Le” as “Nore”. Yes.
[0060]
For example, when the candidate character string completely matches the search character string, the ranking unit 26 assigns the point “100” as a numerical value indicating the priority, and assigns the point “10” when the candidate character string does not match. . Then, the ranking unit 26 checks whether or not an error character string that is a difference between the candidate character string and the search character string is registered in the error character string registration unit 24 as an error character string pattern. In this case, for example, “40” points are added to the points already acquired.
[0061]
Therefore, for example, since the candidate character string “scanner” is a perfect match, the point “100” is obtained, the candidate character string “scanner” does not match one character, and the error character string registration unit 24 further receives “a → ya”. Are registered, the point “50” obtained by adding the respective points “10” and “40” is obtained. On the other hand, although the candidate character string “scan” does not match one character, since the error character string is not registered in the error character string registration unit 24, only the point “10” is obtained. Then, the ranking unit 26, for example, as a ranking result, sets of candidate character strings and their positional information and points, for example, “scanner (1,10,100)”, “scanner (2,5,50)”, “scan (3 , 21, 10) ”is output to the output unit 14.
[0062]
When the output unit 14 obtains the ranking result from the ranking unit 26, the output unit 14 can perform display using the points included therein. For example, the candidate character strings can be displayed on the screen in the order from the highest point, that is, the highest possibility of matching with the search character string. Further, the output unit 14 may display only candidate character strings that have obtained a point equal to or greater than a certain value, or may display grouped items that have points within a specified range.
[0063]
The ranking unit 26 described above gives certain points to the error character string patterns registered in the error character string registration unit 24, but the points to be given are not necessarily uniform. For example, the error character string registration unit 24 stores the degree of error ease represented by the detection frequency of each error character string pattern and reflects this in the ranking, whereby more detailed ranking can be performed. For example, there is a method in which the ease of error is set with an adjustment coefficient of 0 to 1 and the point common to the error character string pattern is multiplied by the adjustment coefficient of the ease of error. In such a method, for example, if the adjustment coefficient of the candidate character string “scanner” is 0.8 in the above example, the point is 10 + 40 × 0.8 = 42. Further, the user can increase or decrease the detection frequency of the error character string pattern based on the search result.
[0064]
According to the present invention described in the present embodiment, as in the invention described in the first embodiment, the matching itself in the search process requires the error character string pattern registered in the error character string registration unit 24. Therefore, it is possible to deal with character errors, misdivision, and miscombination, and search for all candidate character strings that may match the search character string. It is undeniable that by performing this search without fail, those that have a low possibility of matching with the search character string are also detected as candidate character strings, and the percentage of “dust” (search error) included in the matching result increases. The present invention makes it possible to search without fail and to display the search results in a more probable order, thereby grasping the importance (priority) of each candidate character string when the user uses the search results. Even if a search error occurs, the influence on actual use can be reduced. In a conventional method of performing a search by developing a search character string using an error character string pattern, if the dictionary in which the error character string pattern is registered is not sufficiently enhanced, the search leaks increase and the reliability is lowered. On the other hand, in the present invention, even if there is no data in the error character string registration unit 24, it is possible to search without fail, and the accuracy of the ranking can be improved by enriching the data in the error character string registration unit 24. .
[0065]
[Embodiment 3]
The third embodiment of the present invention relates to another search processing example using another search character string, and the configuration of the text search device according to this embodiment is the same as that of the above-described second embodiment. It is the same.
[0066]
FIG. 9 is a schematic diagram showing an image of the search target text. Here, the search target texts used in the example are document A, document B, and document C. In document A, there is a character string “Altalia” from the 10th character from the beginning. Similarly, in document B, the character string “Al string A” exists from the fifth character from the beginning, and in document C, the character string “Al Yuria” (“Even” is kanji) from the 21st character from the beginning. Exists.
[0067]
FIG. 10 is a schematic diagram showing an index image generated for these search target texts and stored in the index storage unit 2.
[0068]
As a search condition, the user inputs a search character string “Altalia” and a matching degree of 60% to the input unit 4.
[0069]
The input unit 4 notifies the search unit 6 of the search character string “Altalia”. When the search unit 6 obtains this search character string, it searches the index storage unit 2 using each of the characters constituting the key as a key, and stores the result in the target character position information storage unit 8 as in the above embodiment.
[0070]
The search character string expansion unit 10 receives a threshold value ηth of coincidence with the search character string from the input unit 4 and generates a replacement character string corresponding to the threshold value ηth.
[0071]
Here, from the number of characters M = 5 in the search character string and the threshold value ηth = 60% of the matching degree, two erroneous recognition allowable characters are allowed. Specifically, in this example, the search character string expansion unit 10 includes, as replacement character strings, “Altalia”, “## Talia”, “# Le # Rear”, “# Luta # A”, “# Rutari #”, “A ## Rear”, “A # T ##”, “A # Tari #”, “Al ## A”, “Al # L #”, “Alter ###” are generated and output to the matching unit 22. .
[0072]
The matching unit 22 performs matching between the replacement character string and the search target text, and detects the presence of the replacement character string in the search target text. For example, for the replacement string “Altalia”, “a (1,10)”, “le (1,11)”, “ta (1,12)”, “li (1,13)”, “a” (1,14) "matches. Based on the position information, the matching unit 22 acquires the candidate character string “Altalia” from the search target text stored in the text storage unit 20, and sets the combination “Altalia (1,1, 10) "is output to the ranking unit 26. For the replacement character string “# le # rear”, any one character, followed by “le (3, 22)”, followed by any one character, “li (3, 24)”, “ A (3,25) "matches. Based on this position information, the matching unit 22 accesses the text storage unit 20 to obtain the candidate character string “Al Yuria” (“Yu” is kanji), and the matching result is “Al Yuria (3, 21)”. Is output. Further, for the replacement character string “al # a”, rules (iii) to “a (2,5)”, “le (2,6)”, and the like described in the first embodiment are followed. One character and "a (2,8)" match. Based on this position information, the matching unit 22 accesses the text storage unit 20 to acquire the candidate character string “al string a” and outputs “al string a (2, 5)” as a matching result.
[0073]
When the ranking unit 26 obtains a matching result from the matching unit 22, the ranking unit 26 performs ranking of the candidate character strings. In the present apparatus, the points given by the ranking unit 26 are expanded when there are two erroneous recognition allowable characters, and are determined for each case that can occur in that case. For example, it can be determined as follows.
[0074]

[0075]
Further, it is assumed that the error character string pattern registered in the error character string registration unit 24 includes “Tari → column” and “Ta (katakana) → evening (kanji)”.
[0076]
The ranking unit 26 assigns the point “100” in the case of a perfect match to the candidate character string “Altalia”, “Al string A” does not match two characters, and the error string “Tari → string” is an error string. Since it is registered in the

registration unit

24, 10 + 60 = 70 points are given. Further, since the candidate character string “Al Yuria” (“Yu” is Kanji) does not match one character and the error character string “Ta → Yu” is registered in the error character string registration unit 24, 50 + 30 = 80 points are given. Is done. Then, for example, as a ranking result, the ranking unit 26 sets a candidate character string, its position information, and a point set, for example, “Altalia (1,10,100)”, “Al Sequence A (2,5,70)”, “Al "Evening rear (3, 21, 80)" is output to the output unit 14.
[0077]
When the output unit 14 obtains the ranking result from the ranking unit 26, for example, the output unit 14 displays candidate character strings on the screen in descending order of points. Further, the output unit 14 may display only candidate character strings that have obtained a point equal to or greater than a certain value, or may display grouped items that have points within a specified range.
[0078]
Instead of ranking using the error character string registration unit 24, it is also possible to simply perform ranking of perfect match, one character ambiguity, and two character ambiguity.
[Brief description of the drawings]
FIG. 1 is a schematic block diagram of a text search apparatus according to a first embodiment of the present invention.
FIG. 2 is a schematic diagram showing an image of a search target text according to the first embodiment.
FIG. 3 is a schematic diagram showing an image of a search target text index according to the first embodiment.
FIG. 4 is a schematic diagram showing an image of target character position information stored in a target character position information storage unit.
FIG. 5 is a schematic block diagram of a text search apparatus according to a second embodiment of the present invention.
FIG. 6 is a schematic diagram showing an image of a search target text according to the second embodiment.
FIG. 7 is a schematic diagram showing an image of a search target text index according to the second embodiment.
FIG. 8 is a schematic diagram illustrating an example of an error character string pattern registered in an error character string registration unit.
FIG. 9 is a schematic diagram showing an image of a search target text according to a third embodiment.
FIG. 10 is a schematic diagram showing an image of an index generated for a search target text and stored in an index storage unit.
[Explanation of symbols]
2 index storage unit, 4 input unit, 6 search unit, 8 target character position information storage unit, 10 search character string expansion unit, 12, 22 matching unit, 14 output unit, 20 text storage unit, 24 error character string registration unit, 26 Ranking section.

Claims

In a text search device that performs a search process based on a search character string for a search target text,
As a search condition, an input means for receiving an input of a matching degree specified between a search character string and the search character string and its replacement character string;
A replacement character string generating means for generating a replacement character string of the search character string in accordance with the input search character string and the degree of matching;
Candidate search means for searching the search target text for a candidate character string including a character string pattern that matches the generated replacement character string;
An error character string registration means for registering an error character string pattern that may occur in the generation of the search target text;
An error character string detection means for detecting a registered error character string pattern registered in the error character string registration means in a portion different from the search character string in the candidate character string;
A priority according to the possibility of matching with the search character string is determined based on the detection frequency of the registration error character string pattern, and in response to the detection of the registration error character string pattern in the candidate character string, the priority Priority giving means for giving a degree to the candidate character string;
A text search apparatus comprising:

The text search apparatus according to claim 1, wherein the degree of coincidence is expressed based on a ratio between the number of characters in the search character string and the number of characters remaining in the replacement character string that are not replaced .

The replacement character string includes a misrecognition allowable character, and the candidate search unit includes a unit that regards the misrecognition allowable character included in the generated replacement character string as an arbitrary character in the search target text. The text search apparatus according to claim 1, wherein the search is performed.

The replacement character string includes a misrecognition allowable character, and the candidate search unit includes a unit that regards the misrecognition allowable character included in the generated replacement character string as two arbitrary characters in the search target text. The text search apparatus according to claim 1, wherein the search is performed.

The replacement character string includes a misrecognition allowable character, and the candidate search means sets two consecutive misrecognition allowable characters included in the generated replacement character string as an arbitrary character in the search target text. The text search apparatus according to claim 1, wherein the search is performed with a means to be regarded.

The error character string registration means further stores the detection frequency of the registered error character string pattern, and the priority assigning means determines the priority according to the detection frequency stored in the error character string registration means. The text search apparatus according to claim 1.

4. The text search apparatus according to claim 3, further comprising candidate character string display means for displaying the candidate character string in accordance with the priority.

In a text search method for performing search processing on a search target text on a computer based on a search character string,
The step of accepting input of the degree of coincidence designated between the search character string and the search character string and its replacement character string, as an input means included in the computer;
The replacement character string generating means of the computer generates a replacement character string of the search character string according to the input search character string and the degree of matching;
A candidate search means included in the computer for searching the search target text for a candidate character string including a character string pattern that matches the generated replacement character string;
A step of registering an error pattern that may occur in the generation of the text to be searched by the character string registration means of the computer;
An error character string detection means that the computer has, a step of detecting a registered error character string pattern registered in the character string registration means in a portion different from the search character string in the candidate character string;
The priority giving means of the computer determines priority according to the possibility of matching with the search character string based on the detection frequency of the registration error character string pattern, and the registration error character string in the candidate character string In response to pattern detection, assigning the priority to the candidate character string;
A text search method characterized by comprising: