JP6172565B2

JP6172565B2 - Document processing apparatus and program

Info

Publication number: JP6172565B2
Application number: JP2013122768A
Authority: JP
Inventors: 清水　淳一; 淳一清水; 洋実北; 勝也小柳; 真太郎安達; 徹也脇山; 紘幸岸本
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2013-06-11
Filing date: 2013-06-11
Publication date: 2017-08-02
Anticipated expiration: 2033-06-11
Also published as: JP2014241027A

Description

本発明は、文書処理装置及びプログラムに関する。 The present invention relates to a document processing apparatus and a program.

特許文献１は、検索文字列を対応する文字列イメージに展開する展開工程と、前記展開工程で展開された文字列イメージに所定のフィルタリングを行うフィルタリング工程と、前記フィルタリング工程でフィルタリングされた文字列イメージを独立部分にセグメンテーションし、セグメンテーションされた各部分の文字認識を行い、認識文字列候補を獲得する認識工程と、前記認識工程で獲得された認識文字列候補の違いに基づいて、組み合わせ可能な別の認識文字列候補を生成する生成工程と、前記生成工程で生成された前記組み合わせ可能な別の認識文字列候補と前記認識工程で獲得された認識文字列候補の論理和条件で所定の文字列の検索を行う文字列検索工程とを備えることを特徴とする画像検索方法を開示する。 Japanese Patent Laid-Open No. 2004-151867 discloses a development step of expanding a search character string into a corresponding character string image, a filtering step of performing predetermined filtering on the character string image expanded in the expansion step, and a character string filtered in the filtering step Segmentation of images into independent parts, character recognition of each segmented part is performed, and recognition can be performed based on the difference between the recognition character string candidate obtained in the recognition process and the recognition character string candidate obtained in the recognition process A generation process for generating another recognition character string candidate, a predetermined character in a logical sum condition of the other recognition character string candidate that can be combined generated in the generation process and the recognition character string candidate acquired in the recognition process An image search method comprising a character string search step for searching for a column is disclosed.

特許文献２は、画像情報を入力するイメージスキャナと、入力画像を認識するＯＣＲソフトウェアと、この認識結果を格納するテキスト情報格納部と、指定された検索語に余分な文字が挿入されることを仮定したり、検索語から文字を間引いたパターンを作成してそのパターンを用いて検索を行う文書検索ソフトウェアとを有する情報処理方法及び装置を開示する。 Patent Document 2 discloses that an image scanner that inputs image information, OCR software that recognizes an input image, a text information storage unit that stores the recognition result, and that extra characters are inserted into a specified search word. An information processing method and apparatus including document search software that makes assumptions or creates a pattern in which characters are thinned out from a search word and performs a search using the pattern is disclosed.

特開平１０−０６９４９４号公報JP 10-0669494 A 特開平０９−０１６６１９号公報Japanese Patent Laid-Open No. 09-016619

本発明の目的は、文書を高精度に仕分けることができる規則を生成することができる文書処理装置及びプログラムを提供することにある。 An object of the present invention is to provide a document processing apparatus and program capable of generating a rule capable of sorting documents with high accuracy.

請求項１に係る本発明は、文書の画像情報を受け付ける受付手段と、前記受付手段により受け付けられた文書の画像情報から文字列を含む文字情報を抽出する文字情報抽出手段と、前記文字情報抽出手段により抽出された文字情報に基づいて前記受付手段により受け付けられた文書を仕分ける仕分け手段と、前記文字情報抽出手段の文字列認識の冗長度を調整するように前記仕分け手段の仕分け規則を生成する仕分け規則生成手段と、を有し、前記仕分け規則生成手段は、前記文字情報抽出手段によって認識されるべき文字列の認識率の目標値が予め設定され、該目標値以上となるように文字列認識の冗長度を調整した仕分け規則を生成し、前記文字情報抽出手段により抽出された文字列を構成する少なくとも１つの文字を全ての対象文字と適合する文字に置き換える処理を行い、前記文字情報抽出手段により抽出された文字列の認識率が目標値に達していないと予測される場合、前記文字列を構成する文字を全ての対象文字と適合する文字に徐々に置き換えることにより文字列の認識率が目標値に達するよう処理する
文書処理装置である。
The present invention according to claim 1 is a receiving unit that receives image information of a document, a character information extracting unit that extracts character information including a character string from the image information of the document received by the receiving unit, and the character information extraction Sorting means for sorting the documents received by the receiving means based on the character information extracted by the means, and generating a sorting rule for the sorting means so as to adjust redundancy of character string recognition of the character information extracting means possess a sorting rule generating means, wherein the sorting rule generating means, said character information is set target value of the recognition rate of the character string to be recognized by the extraction means in advance, the character string so that the target value or more A sorting rule with adjusted recognition redundancy is generated, and at least one character constituting the character string extracted by the character information extracting means is set as all target characters. If the recognition rate of the character string extracted by the character information extraction means is predicted not to reach the target value, the characters constituting the character string are matched with all target characters. The document processing apparatus performs processing so that the recognition rate of the character string reaches a target value by gradually replacing the character with the character to be processed .

請求項２に係る本発明は、前記仕分け規則生成手段は、前記文字情報抽出手段により抽出される文字毎の予想認識率に基づいて仕分け規則を生成する請求項１記載の文書処理装置である。
The present invention according to claim 2, wherein the sorting rule generating means is a document processing apparatus according to claim 1, wherein generating the sorting rule based on the expected recognition rate of each character extracted by the character information extracting section.

請求項３に係る本発明は、前記仕分け規則生成手段は、前記文字情報抽出手段により抽出される文字の大きさ毎の予想認識率に基づいて仕分け規則を生成する請求項１又は２記載の文書処理装置である。
According to a third aspect of the present invention, in the document according to the first or second aspect , the sorting rule generating unit generates the sorting rule based on an expected recognition rate for each character size extracted by the character information extracting unit. It is a processing device.

請求項４に係る本発明は、前記仕分け規則生成手段は、生成された規則が重複した場合、新たな仕分け規則を生成する請求項１乃至３いずれか記載の文書処理装置である。
The present invention according to claim 4 is the document processing apparatus according to any one of claims 1 to 3 , wherein the sorting rule generation means generates a new sorting rule when the generated rules overlap.

請求項５に係る本発明は、前記仕分け規則生成手段により生成された規則に基づいて仕分ける試験用データを生成する試験用データ生成手段をさらに有する請求項１乃至４いずれか記載の文書処理装置である。
The present invention according to claim 5, the document processing device according to any one of claims 1 to 4 further comprising a test data generating means for generating test data to sort on the basis of the generated rules by the sorting rule generation means is there.

請求項６に係る本発明は、文書の画像情報を受け付ける受付ステップと、受け付けられた文書の画像情報から文字列を含む文字情報を抽出する文字情報抽出ステップと、抽出された文字情報に基づいて前記受付ステップにおいて受け付けられた文書を仕分ける仕分けステップと、文字列認識の冗長度を調整するように仕分け規則を生成する規則生成ステップと、をコンピュータに実行させ、前記仕分け規則生成ステップは、前記文字情報抽出ステップにおいて認識されるべき文字列の認識率の目標値が予め設定され、該目標値以上となるように文字列認識の冗長度を調整した仕分け規則を生成し、前記文字情報抽出ステップにおいて抽出された文字列を構成する少なくとも１つの文字を全ての対象文字と適合する文字に置き換える処理を行い、前記文字情報抽出ステップにおいて抽出された文字列の認識率が目標値に達していないと予測される場合、前記文字列を構成する文字を全ての対象文字と適合する文字に徐々に置き換えることにより文字列の認識率が目標値に達するよう処理するプログラムである。
The present invention according to claim 6 is based on the receiving step for receiving the image information of the document, the character information extracting step for extracting the character information including the character string from the image information of the received document, and the extracted character information. wherein the sorting sorting step documents accepted in the reception step, to execute the rule generating step of generating a sorting rule to adjust the redundancy of the character string recognition, to computer, the sorting rule generation step, the character A target value of the recognition rate of the character string to be recognized in the information extraction step is set in advance, and a sorting rule that adjusts the redundancy of character string recognition so as to be equal to or higher than the target value is generated, and in the character information extraction step Performs processing to replace at least one character composing the extracted character string with a character that matches all target characters When it is predicted that the recognition rate of the character string extracted in the character information extraction step does not reach the target value, the characters constituting the character string are gradually replaced with characters that match all the target characters. This is a program for processing so that the recognition rate of a column reaches a target value .

請求項１に係る本発明によれば、本構成を有しない場合と比較して、文書を高精度に仕分けることができる規則を生成することができる文書処理装置を提供することができる。 According to the first aspect of the present invention, it is possible to provide a document processing apparatus capable of generating a rule that can classify a document with high accuracy as compared with a case without this configuration.

請求項２に係る本発明によれば、請求項１に係る発明により得られる効果に加えて、本構成を有しない場合と比較して、認識率を正確に予測することができる。
According to the second aspect of the present invention, in addition to the effect obtained by the first aspect of the present invention, the recognition rate can be accurately predicted as compared with the case where the present configuration is not provided.

請求項３に係る本発明によれば、請求項１又は２に係る発明により得られる効果に加えて、文字の大きさが異なる場合にも対応することができる。
According to the third aspect of the present invention, in addition to the effect obtained by the first or second aspect of the present invention, it is possible to cope with a case where the character size is different.

請求項４に係る本発明によれば、請求項１乃至３いずれかに係る発明により得られる効果に加えて、生成された仕分け規則が重複した場合にも対応することができる。
According to the present invention of claim 4 , in addition to the effect obtained by the invention of any one of claims 1 to 3, it is possible to cope with the case where the generated sorting rules overlap.

請求項５に係る本発明によれば、請求項１乃至４いずれかに係る発明により得られる効果に加えて、生成した規則により予測された結果が得られるか否かを確認することができる。
According to the present invention of claim 5 , in addition to the effect obtained by the invention of any one of claims 1 to 4, it can be confirmed whether or not the result predicted by the generated rule is obtained.

請求項６に係る本発明によれば、本構成を有しない場合と比較して、文書を高精度に仕分けることができる規則を生成することができるプログラムを提供することができる。
According to the sixth aspect of the present invention, it is possible to provide a program that can generate a rule that can classify a document with high accuracy as compared with the case without this configuration.

本発明の実施形態に係る文書処理システムの構成を示す模式図である。1 is a schematic diagram illustrating a configuration of a document processing system according to an embodiment of the present invention. 本発明の実施形態に用いたスキャナ装置のハードウエアを示すブロック図である。It is a block diagram which shows the hardware of the scanner apparatus used for embodiment of this invention. 本発明の実施形態に係る帳票分類システムを示す模式図である。It is a mimetic diagram showing a form classification system concerning an embodiment of the present invention. 本発明の実施形態に係るソフトウェア構成を示すブロック図である。It is a block diagram which shows the software structure which concerns on embodiment of this invention. 本発明の実施形態に用いた帳票の一例を示す平面図である。It is a top view which shows an example of the form used for embodiment of this invention. 本発明の実施形態における仕分け規則を生成する第２例の確率を計算する計算方法を示す模式図である。It is a schematic diagram which shows the calculation method which calculates the probability of the 2nd example which produces | generates the sorting rule in embodiment of this invention. 本発明の実施形態における仕分け規則を生成する第１例のアルゴリズムを実施するためのフローチャートである。It is a flowchart for implementing the algorithm of the 1st example which produces | generates the sorting rule in embodiment of this invention. 本発明の実施形態における仕分け規則を生成する第２例のアルゴリズムを実施するためのフローチャートである。It is a flowchart for implementing the algorithm of the 2nd example which produces | generates the sorting rule in embodiment of this invention. 本発明の実施形態における処理フローの一部を示すフローチャートであるIt is a flowchart which shows a part of processing flow in embodiment of this invention.

以下、本発明の実施形態について図面を参照して詳細に説明する。
図１は、本発明の実施形態に係る文書処理システム２の構成の一例を示す模式図である。本発明の画像形成システム２は、パーソナルコンピュータ等により構成される端末装置４−１〜４−ｎ（ｎは、１以上の整数）と、これら端末装置４−１〜４−ｎとネットワークを介して接続されたサーバ装置６と、このサーバ装置６とネットワークを介して接続された画像形成装置８と、同じくサーバ装置６とネットワークを介して接続されたスキャナ装置１０とから構成されている。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a schematic diagram showing an example of the configuration of a document processing system 2 according to an embodiment of the present invention. The image forming system 2 of the present invention includes terminal devices 4-1 to 4-n (n is an integer of 1 or more) configured by a personal computer or the like, and these terminal devices 4-1 to 4-n via a network. The server device 6 connected in this manner, the image forming device 8 connected to the server device 6 via a network, and the scanner device 10 connected to the server device 6 via the network.

画像形成装置８は、スキャナ部８ａを有し、スキャナ部８ａにより紙文書が読み取られ、電子文書に変換されるようになっている。 The image forming apparatus 8 includes a scanner unit 8a, and a paper document is read by the scanner unit 8a and converted into an electronic document.

スキャナ装置１０は、図２に示されるように、ＣＰＵ１２、メモリ１４、ハードディスクドライブ（ＨＤＤ）等の記憶装置１６、ネットワークを介してサーバ装置６などの外部の装置との間でデータの送信及び受信を行う通信装置１８、スキャン駆動部２０及び操作部２２がバスを介して接続されている。 As shown in FIG. 2, the scanner device 10 transmits and receives data to and from an external device such as a server device 6 and a storage device 16 such as a CPU 12, a memory 14, and a hard disk drive (HDD), and a network. The communication device 18, the scan drive unit 20, and the operation unit 22 are connected through a bus.

ＣＰＵ１２は、メモリ１４又は記憶装置１６に格納されたプログラムに基づいて処理を実行して、スキャン装置１０の動作を制御する。 The CPU 12 executes processing based on a program stored in the memory 14 or the storage device 16 and controls the operation of the scanning device 10.

図３は、文書処理装置の一例として帳票分類システム２４を示す。帳票分類システム２４は、紙文書である帳票２６を前述したスキャナ装置１０又は画像形成装置８のスキャナ部８ａにより読み取る。この読み取られた文書は、ＯＣＲ処理などにより認識され、電子文書２８として仕分けられる。ここでは、「支給受領書」「支給清算書」「不可抗力協議」等、予め定められた帳票を分類し、例えば前述したサーバ装置６に予め用意されている各フォルダ３０に収納される。 FIG. 3 shows a form classification system 24 as an example of a document processing apparatus. The form classification system 24 reads a form 26, which is a paper document, by the scanner unit 10 or the scanner unit 8a of the image forming apparatus 8 described above. The read document is recognized by OCR processing or the like and sorted as an electronic document 28. Here, predetermined forms such as “payment receipt”, “payment settlement”, and “force majeure consultation” are classified and stored in each folder 30 prepared in advance in the server device 6 described above, for example.

図４は、スキャナ装置１０における文書処理プログラムの構成を示す。受付部３２は、スキャナ装置１０により読み取られた画像を例えばビットマップデータとして受け付ける。文書情報抽出部３４は、例えばＯＣＲソフトウェアから構成され、受付部３２により受け付けられた画像から予め定められた方法により文字列を含む文字情報を抽出する。例えば図５に示すように、ここで取り扱う帳票は、全て予め定められた位置に表題、例えば「支給受領書」が記載され、この位置にある文字列を抽出する。抽出すべき文字列は、特定の位置にあることを条件とするばかりではなく、例えば色や大きさなど、構成する文字の属性を捕らえるようにすればよい。 FIG. 4 shows the configuration of a document processing program in the scanner device 10. The receiving unit 32 receives an image read by the scanner device 10 as, for example, bitmap data. The document information extraction unit 34 is composed of, for example, OCR software, and extracts character information including a character string from an image received by the reception unit 32 by a predetermined method. For example, as shown in FIG. 5, all the forms handled here have a title, for example, “payment receipt” written at a predetermined position, and a character string at this position is extracted. The character string to be extracted is not limited to being in a specific position, but it is only necessary to capture the attributes of the constituent characters such as color and size.

仕分け部３６は、文書情報抽出部３４により抽出された文字情報に基づいて帳票を仕分ける（分類する）。この仕分け部３６において帳票を仕分ける規則は、仕分け規則生成部３８により生成される。仕分け部３６により仕分けられた帳票は、前述したようにサーバ装置６に送信され、サーバ装置６内のファイルに帳票毎にファイルされる。 The sorting unit 36 sorts (classifies) the form based on the character information extracted by the document information extracting unit 34. Rules for sorting forms in the sorting unit 36 are generated by a sorting rule generating unit 38. The form sorted by the sorting unit 36 is transmitted to the server device 6 as described above, and is filed for each form in a file in the server device 6.

試験用データ生成部４０は、仕分けるべき帳票がどの程度正しく仕分けられるかを検証するための試験用データを生成する。この試験用データ生成部４０で生成された試験用データは、文字情報抽出部３４で文字情報が抽出され、仕分け規則生成部３８で生成された仕分け規則に基づいて仕分け部３６で仕分けされたデータが前述した端末装置４−１〜４−ｎや画像形成装置８に送られ、仕分け結果が出力されるようになっている。 The test data generation unit 40 generates test data for verifying how correctly the forms to be sorted are sorted. From the test data generated by the test data generation unit 40, the character information is extracted by the character information extraction unit 34, and the data sorted by the sorting unit 36 based on the sorting rules generated by the sorting rule generation unit 38. Are sent to the terminal devices 4-1 to 4-n and the image forming apparatus 8 described above, and the sorting result is output.

次に仕分け規則生成部３８における仕分け規則を生成するためのアルゴリズムについて説明する。 Next, an algorithm for generating a sorting rule in the sorting rule generating unit 38 will be described.

この実施形態においては、文字情報抽出部３４による文字認識率が文字毎に異なることを前提としている。まずは、文字情報抽出部３４が抽出すべき文字列を構成する文字の予想認識率を設定する。 In this embodiment, it is assumed that the character recognition rate by the character information extraction unit 34 is different for each character. First, the expected recognition rate of the character which comprises the character string which the character information extraction part 34 should extract is set.

予想認識率は、例えば種々の文字を含むテストチャートを読み込み、読み込んだ文字を文字認識させた場合の正解率を予め求めておく。例えば「支」は９８％、「給」は９０％、「受」は８５％、「領」は８０％、「書」は９５％である。
なお、文字認識率は、文字の大きさにより異なるので、異なる大きさの文字を用いる場合は、各文字について大きさ毎に予想認識率を求める。 As the expected recognition rate, for example, a test chart including various characters is read, and a correct answer rate when the read characters are recognized is obtained in advance. For example, “support” is 98%, “salary” is 90%, “receive” is 85%, “territory” is 80%, and “book” is 95%.
Since the character recognition rate varies depending on the size of the character, when characters having different sizes are used, an expected recognition rate is obtained for each size of each character.

ここで、仕分け規則を次のように表す。
文書タイトル＝”支給受領書”
これは、全ての文字が一致している必要があることを意味する。したがって、「支給受領書」という文字列では、予想認識率は次のようになる。
支＊給＊受＊領＊書＝ 98％＊90％＊85％＊80％＊95％＝ 57％ Here, the sorting rules are expressed as follows.
Document title = “Receipt receipt”
This means that all characters must match. Therefore, the expected recognition rate for the character string “payment receipt” is as follows.
Payment * Receiving * Receiving * Area * Book = 98% * 90% * 85% * 80% * 95% = 57%

このままでは、予想認識率が低いので、高くする必要がある。予想認識率を高くするには冗長度を上げればよい。目標認識率を設定し、この目標認識率を超えるように冗長度を調整する。例えば「支給受領書」では目標認識率を９０％に設定する。 As it is, the expected recognition rate is low, so it needs to be increased. To increase the expected recognition rate, the redundancy should be increased. A target recognition rate is set, and the redundancy is adjusted so as to exceed the target recognition rate. For example, in the “payment receipt”, the target recognition rate is set to 90%.

冗長度を上げる第１例は、ワイルドカードを増やしていく方法である。ワイルドカード文字とは、全ての対象文字とマッチする文字である。ワイルドカード文字を「？」として表す。予想認識率が低い文字からワイルドカード文字に置き換える。 A first example of increasing redundancy is a method of increasing wildcards. A wild card character is a character that matches all target characters. The wildcard character is represented as “?”. Replace characters with low expected recognition rate with wildcard characters.

上記「支給受領書」では、「領」の予想認識率が最も低いので、「領」をワイルドカード文字に置き換え、次のように仕分け規則を生成する。
文書タイトル＝”支給受？書”
この場合の文字列としての予想認識率は次の通りである。
支＊給＊受＊？＊書＝ 98％＊90％＊85％＊100％＊95％＝ 71％ (＜ 90％) In the “payment receipt”, the expected recognition rate of “territory” is the lowest, so “territory” is replaced with a wild card character, and a sorting rule is generated as follows.
Document title = "Supply receipt?"
The expected recognition rate as a character string in this case is as follows.
Payment * salary * receipt *? * Letters = 98% * 90% * 85% * 100% * 95% = 71% (<90%)

次に予想認識率が低いのは、「受」であるから、「受」をワイルドカード文字に置き換え、次のように仕分け規則を生成する。
文書タイトル＝”支給？？書”
この場合の文字列としての予想認識率は次の通りである。
支＊給＊？＊？＊書＝ 98％＊90％＊100％＊100％＊95％＝ 83％ (＜ 90％) Next, since the “estimated” has the lowest expected recognition rate, “accepted” is replaced with a wildcard character, and a sorting rule is generated as follows.
Document title = "Supply ??"
The expected recognition rate as a character string in this case is as follows.
Supply*? *? * Letters = 98% * 90% * 100% * 100% * 95% = 83% (<90%)

次に予想認識率が低いのは、「給」であるから、「給」をワイルドカード文字に置き換え、次のように仕分け規則を生成する。
文書タイトル＝”支？？？書”
この場合の文字列としての予想認識率は次の通りである。
支＊？＊？＊？＊書＝ 98％＊100％＊100％＊100％＊95％＝ 93％ (≧ 90％)
このように目標を満たさない場合は、繰り返しワイルドカード文字を増やしていくことにより予想認識率９０％を超えることができる。 Next, since “Sal” has the lowest expected recognition rate, “Salary” is replaced with a wildcard character, and a sorting rule is generated as follows.
Document title = “support”
The expected recognition rate as a character string in this case is as follows.
Support * *? *? * Letter = 98% * 100% * 100% * 100% * 95% = 93% (≥ 90%)
Thus, when the target is not satisfied, the expected recognition rate of 90% can be exceeded by repeatedly increasing the wild card characters.

ただし、この第１例では、他の仕分け規則と重複してしまう可能性もある。例えば、「支給清算書」でも、文書タイトル＝”支？？？書”となる可能性がある。
またすべての文字の予想認識率が目標認識率より低い場合は、この方法は使用できない However, in this first example, there is a possibility that it will overlap with other sorting rules. For example, even in the “payment checkout”, there is a possibility that the document title = “support”.
Also, this method cannot be used when the expected recognition rate of all characters is lower than the target recognition rate.

次に、冗長度を上げる第２例について説明する。
この第２例は、予想認識率の低い文字1個をワイルドカード文字にした仕分け規則を作成し、論理和で増やしていく方法である。
前述の例では、最初に作られる仕分け規則は次のようになる。
文書タイトル＝”支給受？書”では予想認識率は９０％を超えないので、次に生成される仕分け規則は、
文書タイトル＝”支給受？書”＋”支給？領書”
となる。 Next, a second example for increasing the redundancy will be described.
This second example is a method of creating a sorting rule in which one character with a low expected recognition rate is a wild card character and increasing it by logical sum.
In the above example, the first sorting rule created is as follows.
Since the expected recognition rate does not exceed 90% in the document title = “payment receipt?”, The next generated sorting rule is
Document title = "Supply receipt?" + "Supply? Receipt"
It becomes.

この場合の文字列としての予想認識率は次のようにして求められる。
支＊給＊受＊？＊書＝ 98％＊90％＊85％＊100％＊95％＝ 71％
(＜ 90％) ・・・A
支＊給＊？＊領＊書＝ 98％＊90％＊100％＊80％＊95％＝ 67％
(＜ 90％) ・・・B
１−（１−A）＊（１−B）＝90％ (≧ 90％) The expected recognition rate as a character string in this case is obtained as follows.
Payment * salary * receipt *? * Letters = 98% * 90% * 85% * 100% * 95% = 71%
(<90%) ・・・ A
Supply*? * Territory * book = 98% * 90% * 100% * 80% * 95% = 67%
(<90%) ・・・ B
1- (1-A) * (1-B) = 90% (≧ 90%)

即ち、図６（ａ）に示すように、集合Ａと集合Ｂとの論理和の確率を求めるには、まず図６（ｂ）に示すように、集合Ａではない確率（１−A）と集合Ｂではない確率（１−B）を求める。集合Ａではない確率は２９％である。また、集合Ｂではない確率は３３％である。次に、図６（ｃ）に示すように、集合Ａでなく、且つ集合Ｂでもない確率（１−A）＊（１−B）を求める。集合Ａでなく、且つ集合Ｂでもない確率は１０％である。さらに（集合Ａでなく、且つ集合Ｂでもない）ではない確率、即ち、集合Ａ又は集合Ｂである確率は１− （１−A）＊（１−B）であり、９０％となる。 That is, as shown in FIG. 6A, in order to obtain the logical sum probability of the set A and the set B, first, as shown in FIG. The probability (1-B) that is not the set B is obtained. The probability that it is not set A is 29%. Further, the probability that the set is not set B is 33%. Next, as shown in FIG. 6C, a probability (1-A) * (1-B) that is neither set A nor set B is obtained. The probability that it is neither set A nor set B is 10%. Further, the probability of not being set A and not set B, that is, the probability of being set A or set B is 1- (1-A) * (1-B), which is 90%.

以上のように、最も低い文字をワイルドカード文字に置き換えても予想認識率が目標を満たさない場合は、次に予想認識率の低い文字を1個ワイルドカード文字にして、２つの規則の論理和を取った時の予想認識率を求める。さらに目標を満たさない場合は、３番目の・・というように目標を満たすまで繰り返す。1文字ずつワイルドカード文字にして論理和をとっても予想認識率が目標を満たさない場合は、ワイルドカードを２個ずつにして行う。 As described above, if the predicted recognition rate does not meet the target even if the lowest character is replaced with a wild card character, the next character with the lowest expected recognition rate is changed to one wild card character, and the logic of the two rules. Find the expected recognition rate when taking the sum. If the goal is not met, repeat until the goal is met. If the expected recognition rate does not meet the target even if the wildcard character is used for each character and the logical sum is taken, two wildcards are used.

図７は、前述した第１例のアルゴリズムを実施するためのフローチャートである。
まずステップＳ１０において、仕分け規則の予想認識率を算出する。前述の例では、支＊給＊受＊領＊書＝ 98％＊90％＊85％＊80％＊95％＝ 57％である。 FIG. 7 is a flowchart for implementing the algorithm of the first example described above.
First, in step S10, an expected recognition rate of the sorting rule is calculated. In the above example, payment * supply * receipt * area * form = 98% * 90% * 85% * 80% * 95% = 57%.

予想認識率が目標を超えない場合は、次のステップＳ１２において、ｎ＝１とする。ｎは、認識対象となる文字列の文字数である。次のステップＳ１４においては、予想認識率が低い文字からｎ文字をワイルドカード文字に置き換えた仕分け規則を生成する。前述の例では、文書タイトル＝”支給受？書”である。 When the predicted recognition rate does not exceed the target, n = 1 is set in the next step S12. n is the number of characters in the character string to be recognized. In the next step S14, a sorting rule is generated in which n characters are replaced with wild card characters from characters with a low expected recognition rate. In the above example, the document title = “payment receipt?”.

次のステップＳ１６においては、ステップＳ１４で生成した仕分け規則の予想認識率を算出する。前述した例では、支＊給＊受＊？＊書＝ 98％＊90％＊85％＊100％＊95％＝ 71％である。 In the next step S16, the expected recognition rate of the sorting rule generated in step S14 is calculated. In the above example, payment * payment * receipt *? * Letter = 98% * 90% * 85% * 100% * 95% = 71%.

次のステップＳ１８においては、ステップＳ１６で算出した予想認識率が目標認識率以上であるか否かを判定する。このステップＳ１８により予想認識率が目標認識率以上と判定された場合は終了する。このステップＳ１８により予想認識率が目標認識率未満である場合は、次のステップＳ２０へ進む。ステップＳ２０においては、ｎ＝ｎ＋１とする。前述の例では、目標認識率の９０％未満であるから、ステップＳ２０において、ｎ＝２とする。 In the next step S18, it is determined whether or not the predicted recognition rate calculated in step S16 is equal to or higher than the target recognition rate. If it is determined in step S18 that the predicted recognition rate is greater than or equal to the target recognition rate, the process ends. If the predicted recognition rate is less than the target recognition rate in step S18, the process proceeds to the next step S20. In step S20, n = n + 1. In the above example, since it is less than 90% of the target recognition rate, n = 2 is set in step S20.

次のステップＳ２２においては、ｎが文字列長（前述の例では５）を超えているか否かを判定する。ｎが文字列長を超えている場合は、それ以上、仕分け規則を生成することができないので終了する。一方、ステップＳ２２において、ｎが文字列長以下であると判定すると、ステップＳ１４に戻る。前述した例では、ｎ＝３である文書タイトル＝”支？？？書”となった段階で初めて目標認識率以上となる。 In the next step S22, it is determined whether n exceeds the character string length (5 in the above example). If n exceeds the character string length, no more sorting rules can be generated and the process ends. On the other hand, if it is determined in step S22 that n is less than or equal to the character string length, the process returns to step S14. In the above-described example, the target recognition rate is exceeded when the document title where n = 3 = “support” is not reached.

図７は、前述した第２例のアルゴリズムを実施するためのフローチャートである。
まずステップＳ３０において、仕分け規則の予想認識率を算出する。前述の例では、支＊給＊受＊領＊書＝ 98％＊90％＊85％＊80％＊95％＝ 57％である。 FIG. 7 is a flowchart for implementing the above-described second example algorithm.
First, in step S30, the expected recognition rate of the sorting rule is calculated. In the above example, payment * supply * receipt * area * form = 98% * 90% * 85% * 80% * 95% = 57%.

予想認識率が目標を超えない場合は、次のステップＳ３２において、ｍ＝１、ｎ＝１とする。ｍは、認識対象となる文字列において、予想認識率が低い順序であり、ｎは、認識対象となる文字列の文字数である。 If the predicted recognition rate does not exceed the target, m = 1 and n = 1 are set in the next step S32. m is the order in which the expected recognition rate is low in the character string to be recognized, and n is the number of characters in the character string to be recognized.

次のステップＳ３４においては、予想認識率がｍ番目に低い文字からｎ文字をワイルドカード文字に置き換えた仕分け規則を生成する。前述の例では、文書タイトル＝”支給受？書”である。 In the next step S34, a sorting rule is generated by replacing n characters with wildcard characters from the mth character with the lowest expected recognition rate. In the above example, the document title = “payment receipt?”.

次のステップＳ３６においては、作成済みの仕分け規則と論理和を取った仕分け規則を生成する。作成済みがない場合は、ステップＳ３４で生成した仕分け規則がそのまま採用される。 In the next step S36, a sorting rule obtained by ORing the created sorting rules is generated. If it has not been created, the sorting rule generated in step S34 is adopted as it is.

次のステップＳ３８においては、ステップＳ３６で生成した仕分け規則の予想認識率を算出する。前述した例では、支＊給＊受＊？＊書＝ 98％＊90％＊85％＊100％＊95％＝ 71％である。 In the next step S38, the expected recognition rate of the sorting rule generated in step S36 is calculated. In the above example, payment * payment * receipt *? * Letter = 98% * 90% * 85% * 100% * 95% = 71%.

次のステップＳ４０においては、ステップＳ３８で算出した予想認識率が目標認識率以上であるか否かを判定する。このステップＳ４０により予想認識率が目標認識率以上と判定された場合は終了する。このステップＳ４０により予想認識率が目標認識率未満である場合は、次のステップＳ４２へ進む。ステップＳ４２においては、ｍ＝ｍ＋１とする。前述の例では、目標認識率の９０％未満であるから、ステップＳ４２において、ｍ＝２とする。 In the next step S40, it is determined whether or not the predicted recognition rate calculated in step S38 is equal to or higher than the target recognition rate. If it is determined in step S40 that the expected recognition rate is equal to or higher than the target recognition rate, the process ends. If the predicted recognition rate is less than the target recognition rate in step S40, the process proceeds to the next step S42. In step S42, m = m + 1. In the above example, since it is less than 90% of the target recognition rate, m = 2 is set in step S42.

次のステップＳ４４においては、ｍが文字列長（前述の例では５）を超えているか否かを判定する。ｍが文字列長を超えている場合は、次のステップＳ４６へ進み、ｍ＝１、ｎ＝ｎ＋１とし、ワイルドカード文字に置き換える文字数を１つ増やす。一方、ステップＳ４４において、ｍが文字列長以下であると判定すると、ステップＳ３４に戻る。 In the next step S44, it is determined whether m exceeds the character string length (5 in the above example). If m exceeds the character string length, the process proceeds to the next step S46, where m = 1 and n = n + 1, and the number of characters to be replaced with wildcard characters is increased by one. On the other hand, if it is determined in step S44 that m is less than or equal to the character string length, the process returns to step S34.

ステップＳ３４に戻った場合は、予想認識率が２番目に低い文字から１文字をワイルドカード文字に置き換えた仕分け規則を生成する。前述の例では、文書タイトル＝”支給？領書”である。 When the process returns to step S34, a sorting rule is generated by replacing one character with a wild card character from the character with the second lowest expected recognition rate. In the above example, the document title = “Supply?

次のステップＳ３６においては、作成済みの仕分け規則と論理和を取った仕分け規則を生成する。前述の例では、文書タイトル＝”支給受？書”＋”支給？領書”である。 In the next step S36, a sorting rule obtained by ORing the created sorting rules is generated. In the above example, the document title = “payment receipt?” + “Payment?

次のステップＳ３８においては、ステップＳ３６で生成した仕分け規則の予想認識率を算出する。予想認識率が９０％となり、目標認識率以上となって処理を終了する。 In the next step S38, the expected recognition rate of the sorting rule generated in step S36 is calculated. The expected recognition rate is 90%, and the processing ends when the target recognition rate is exceeded.

前述した第１例では、例えば文書タイトル＝”支？？？書”となって、「支給受領書」と「支給清算書」との仕分け規則が重なる可能性がある。そこで、図９に示すように、ステップＳ５０において、重複する仕分け規則があるか否かを判定する。ステップＳ５０において、重複する仕分け規則がない場合は処理を終了するが、重複する仕分け規則がある場合は、新たなアルゴリズムで仕分け規則を生成する。例えば第１例のアルゴリズムから第２例のアルゴリズムに変更して新たな仕分け規則を生成する。 In the first example described above, for example, the document title = “support” form, and there is a possibility that the sorting rules of “payment receipt” and “payment settlement” overlap. Therefore, as shown in FIG. 9, in step S50, it is determined whether there is an overlapping sorting rule. In step S50, if there is no overlapping sorting rule, the process ends. If there is an overlapping sorting rule, a sorting rule is generated with a new algorithm. For example, a new sorting rule is generated by changing from the algorithm of the first example to the algorithm of the second example.

なお、冗長度を上げる方法は、前述の第１例、第２例に限定するものではない。例えば予想認識率の低い文字に類似する文字をＯＲ条件として含ませる等であってもよい。 The method for increasing the redundancy is not limited to the first example and the second example described above. For example, a character similar to a character with a low expected recognition rate may be included as an OR condition.

２文書処理システム
８画像形成装置
１０スキャナ装置
２４帳票分類システム
２６紙文書
２８電子文書
３２受付部
３４文字情報抽出部
３６仕分け部
３８仕分け規則生成部
４０試験用データ生成部 2 Document Processing System 8 Image Forming Apparatus 10 Scanner Device 24 Form Classification System 26 Paper Document 28 Electronic Document 32 Reception Unit 34 Character Information Extraction Unit 36 Sorting Unit 38 Sorting Rule Generation Unit 40 Test Data Generation Unit

Claims

Receiving means for receiving image information of the document;
Character information extracting means for extracting character information including a character string from image information of the document received by the receiving means;
Sorting means for sorting the documents received by the receiving means based on the character information extracted by the character information extracting means;
A sorting rule generating means for generating a sorting rule of the sorting means so as to adjust the redundancy of character string recognition of the character information extracting means;
I have a,
The sorting rule generation means includes:
A target value of the recognition rate of the character string to be recognized by the character information extracting means is preset, and a sorting rule that adjusts the redundancy of character string recognition to be equal to or higher than the target value is generated,
A process of replacing at least one character constituting the character string extracted by the character information extracting means with a character that matches all target characters;
When it is predicted that the recognition rate of the character string extracted by the character information extraction means does not reach the target value, the characters constituting the character string are gradually replaced with characters that match all the target characters. A document processing apparatus that performs processing so that the recognition rate of a column reaches a target value .

The sorting rule generation means, the character information extracting section based on the expected recognition rate for each character to be extracted by generating a sorting rule claim 1 document processing apparatus according.

The sorting rule generation means, the document processing apparatus according to claim 1 or 2 wherein generating the sorting rules based on the expected recognition rate for each size of the characters extracted by the character information extracting section.

The sorting rule generation means, if the generated rule is duplicated document processing device according to any one of claims 1 to 3 to produce a new sorting rules.

The document processing device according to any one of claims 1 to 4 further comprising a test data generating means for generating test data to sort on the basis of the generated rules by the sorting rule generating means.

A reception step for receiving image information of the document;
A character information extraction step of extracting character information including a character string from image information of the accepted document;
A sorting step of sorting the documents received in the receiving step based on the extracted character information;
A rule generation step for generating a sorting rule to adjust redundancy of character string recognition;
To the computer ,
The sorting rule generation step includes:
A target value of the recognition rate of the character string to be recognized in the character information extraction step is set in advance, and a sorting rule that adjusts the redundancy of character string recognition to be equal to or higher than the target value is generated,
A process of replacing at least one character constituting the character string extracted in the character information extraction step with a character suitable for all target characters;
When it is predicted that the recognition rate of the character string extracted in the character information extraction step does not reach the target value, the characters constituting the character string are gradually replaced with characters that match all the target characters. A program that processes the recognition rate of a column to reach a target value .