JPH0528323A

JPH0528323A - Character recognition device

Info

Publication number: JPH0528323A
Application number: JP3177074A
Authority: JP
Inventors: Hiroshi Yoshida; 浩▲史▼ 吉田; Koichi Higuchi; 浩一樋口; Yoshiyuki Yamashita; 義征山下
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1991-07-18
Filing date: 1991-07-18
Publication date: 1993-02-05

Abstract

PURPOSE:To conduct an effective post-processing in accordance with the meaning of a document by selecting a resulted candidate character train (s) from one or more candidate word character train (s) based on the correlation with the word character trains. CONSTITUTION:A candidate word preparation part 130 combines the candidate character trains of respective word areas inputted through a word area cut-out part 120 and prepaes the word candidate character train (s). An appropriate candidate word is selected from these word candidate trains and outputted to a correlation detection part 142 in a result word selection part 140 as the candidate word. Discrimination of whether or not the word is appropriate as such word is based on whether or not the word candidate character train is present in a word dictionary 141 within the result word selection part 140 as an item. When no appropriate word can be obtained, the candidate word preparation part 130 outputs the word candidate character train prepared by combining the first candidate characters to the correlation detection part 142. When one word was outputted as the candidate word, a result word discrimination part 142 selects the candidate word as a resulted word.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、文字を高精度で認識
できる文字認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device capable of recognizing characters with high accuracy.

【０００２】[0002]

【従来の技術】近年、一般文書の高精度入力を可能とす
る文字認識装置がますます望まれている。情報の高度利
用、情報入力の省力化などを図るためである。しかし、
日本語で書かれた一般文書を認識する場合、文書中に同
形文字、類似形状文字が存在するため、文字単位の認識
のみでは高精度な認識を行うのは困難であった。そこ
で、一文字ずつの認識を終えた後にさらに言語知識を用
いた後処理を行う構成の文字認識装置が提案されてい
た。2. Description of the Related Art In recent years, a character recognition device capable of inputting a general document with high precision has been increasingly desired. This is for the purpose of advanced use of information and labor saving of information input. But,
When recognizing a general document written in Japanese, it is difficult to perform high-accuracy recognition only by character recognition, because homomorphic characters and similar-shaped characters exist in the document. Therefore, there has been proposed a character recognition device configured to perform post-processing using linguistic knowledge after recognizing each character.

【０００３】このような後処理を行う構成の文字認識装
置としては、例えばこの出願の出願人に係る文献（昭和
６３年電子情報通信学会春季全国大会予稿集Ｄ−４４
７）に開示の文字認識装置があった。そして、この文字
認識装置では、以下のような処理により文字の認識が行
われていた。As a character recognition device configured to perform such post-processing, for example, a document relating to the applicant of the present application (1988 Proceedings of the Spring National Convention of the Institute of Electronics, Information and Communication Engineers D-44) is proposed.
There is a character recognition device disclosed in 7). In this character recognition device, characters are recognized by the following processing.

【０００４】まず、認識対象文字に対して認識処理が行
なわれ１以上の候補文字列が得られる。次に、この候補
文字列中の同一文字種のつながりを調べることにより単
語領域が検出される。次に、この単語領域中の候補文字
名が組み合わされて単語候補文字列が作成される。次
に、これら単語候補文字列が単語辞書と照合され単語と
して成立するかどうかが判定される。単語として成立す
ると判定された単語候補文字列は候補単語とされる。そ
して、候補単語が一つであった場合には当該候補単語が
結果単語（認識した結果の単語の意味。）とされる。ま
た、候補単語が２以上存在した場合には各候補単語につ
いてこれを構成している各候補文字名の類似度順位の平
均値（以下、「平均候補順位」と称する。）が比較さ
れ、最も平均候補順位が上位の候補単語が結果単語とさ
れる。First, a recognition process is performed on a recognition target character to obtain one or more candidate character strings. Next, the word area is detected by checking the connection of the same character type in this candidate character string. Next, the candidate character names in this word area are combined to create a word candidate character string. Next, it is determined whether or not these word candidate character strings are matched with the word dictionary to be a word. The word candidate character string determined to be established as a word is set as a candidate word. Then, when there is only one candidate word, the candidate word is set as a result word (meaning of the recognized word). When there are two or more candidate words, the average value of the similarity ranks of the candidate character names constituting each candidate word (hereinafter, referred to as “average candidate rank”) is compared, and the candidate word is the most. The candidate word having the highest average candidate rank is set as the result word.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上述し
た従来の文字認識装置では、複数の候補単語の中から結
果単語として適切な単語を選択するための評価値は、候
補単語を構成している各候補文字名の類似度順位に基づ
いて算出されていた（詳細は後記の表１の（Ｄ）欄参
照。）。したがって、後処理に文字単位の認識結果が大
きく影響する。このため、後処理が後処理として十分機
能しないという問題点があった。特に、比較的簡単な文
字から成る入力文字列であってこの文字列を構成する各
文字の候補文字がいくつか存在し然もこれら候補文字を
組み合わせると多くの熟語が得られる結果多くの候補単
語が抽出されるような入力文字列の場合は、文字単位の
認識結果の通りに結果単語が出力される可能性が高くな
る。これについて具体例により説明する。However, in the above-described conventional character recognition device, the evaluation value for selecting an appropriate word as a result word from a plurality of candidate words is determined by each of the candidate words. It was calculated based on the similarity rank of the candidate character names (for details, see column (D) of Table 1 below). Therefore, the character-by-character recognition result greatly affects the post-processing. Therefore, there is a problem that the post-processing does not function sufficiently as the post-processing. In particular, it is an input character string consisting of relatively simple characters, and there are some candidate characters for each character that make up this character string. However, combining these candidate characters yields many idioms. In the case of an input character string in which is extracted, there is a high possibility that the result word will be output according to the recognition result of each character. This will be described with a specific example.

【０００６】例えば、入力文字列が後記表１の（Ａ）欄
に掲げた「文字」である場合で、この入力文字列に対し
後記表１の（Ｂ）欄に掲げた「大字」、「文学」および
「丈宇」の三つの候補文字列が得られ、さらにこれら候
補文字列の候補文字名が組み合わされて後記表１の
（Ｃ）欄に掲げた九つの単語候補文字列が作成され、こ
れら単語候補文字列を単語辞書と照合した結果後記表１
（Ｄ）欄に掲げた「大字」、「大学」、「文字」および
「文学」の４つの候補単語が挙げられている場合は、同
（Ｄ）欄に記載のように４つの候補単語各々の平均候補
順位の算出がなされる。そして、その値が最も小さい
「大字」が結果単語として出力されてしまい、正しい単
語「文字」が候補単語に掲げられているにもかかわらず
これが選択されず結局文字単位の認識結果に左右された
候補単語が出力されてしまう。For example, when the input character string is the "character" listed in the column (A) of Table 1 below, the "large letters" and " Three candidate character strings of "literature" and "jou" were obtained, and the candidate character names of these candidate character strings were combined to create the nine word candidate character strings listed in column (C) of Table 1 below. , The result of matching these word candidate character strings with the word dictionary Table 1 below
When the four candidate words “large letter”, “university”, “letter” and “literature” listed in column (D) are listed, each of the four candidate words is listed in column (D). The average candidate rank of is calculated. Then, "Large letters" with the smallest value was output as a result word, and this was not selected even though the correct word "letter" was listed as a candidate word, and eventually it was affected by the recognition result in character units. Candidate words are output.

【０００７】このため、認識精度が低下し、さらにこれ
を補うための修正作業等の煩雑な操作が必要となるの
で、日本語文書を効率良く入力することが不可能である
という問題点があった。As a result, the recognition accuracy is lowered, and a complicated operation such as a correction work is required to make up for it, which makes it impossible to efficiently input a Japanese document. It was

【０００８】これを回避するために、後処理部をより高
次の言語処理が可能な構成、例えば単語相互の接続を構
文解析や意味解析の手法により検定しこの結果に基づき
候補単語から一つの結果単語を選択する方法も考えられ
る。しかし、この場合は言語処理自体が複雑となるので
処理速度が遅くなる、装置が大がかりになる等の新たな
問題が生じる。In order to avoid this, the post-processing unit is configured to enable higher-level language processing, for example, the mutual connection of words is tested by a syntactic analysis or semantic analysis method, and one of the candidate words is selected based on this result. A method of selecting a result word is also conceivable. However, in this case, since the language processing itself becomes complicated, a new problem arises such that the processing speed becomes slow and the device becomes large.

【０００９】この発明はこのような点に鑑みなされたも
のであり、したがってこの発明の目的は、文書の意味に
準じた効果的な後処理を簡易に行うことができる文字認
識装置を提供することにある。The present invention has been made in view of the above points, and therefore an object of the present invention is to provide a character recognition apparatus capable of easily performing effective post-processing according to the meaning of a document. It is in.

【００１０】[0010]

【課題を解決するための手段】この目的の達成を図るた
め、この発明の文字認識装置によれば、媒体上の文書よ
り文書画像データを得、該文書画像データより前述の文
書中の文字列の認識を行って各文字が１以上の候補文字
より成る候補文字列を得る文字認識部と、前述の候補文
字列より単位文字列に相当する領域を切り出す単位文字
列切り出し部と、前述の切り出された各領域の候補文字
列より１以上の候補の単位文字列を抽出する候補単位文
字列作成部と、前述の抽出された候補の単位文字列より
結果の単位文字列を選択する結果単位文字列選択部とを
具える文字認識装置において、結果単位文字列選択部
は、予め定めた単位文字列を相互に関連付けて登録した
言語辞書と、当該結果単位文字列選択部によって以前に
選択された結果の単位文字列を記憶する単位文字列記憶
部と、前述の候補単位文字列作成部により作成された１
以上の候補の単位文字列と前述の単位文字列記憶部に記
憶されている結果の単位文字列との関連度を前述の言語
辞書を用いて検出する関連度検出部と、前述の１以上の
候補の単位文字列から、前述の関連度検出部で検出した
関連度に基づいて結果の単位文字列を選択する結果の単
位文字列選択部とを具えたことを特徴とする。To achieve this object, according to the character recognition device of the present invention, document image data is obtained from a document on a medium, and the character string in the document is obtained from the document image data. Character recognition unit that obtains a candidate character string in which each character is composed of one or more candidate characters, a unit character string cutout unit that cuts out an area corresponding to a unit character string from the candidate character string, and the above-mentioned cutout unit. Candidate unit character string creation unit that extracts one or more candidate unit character strings from the candidate character strings in each of the selected regions, and a result unit character that selects the resulting unit character string from the extracted candidate unit character strings In the character recognition device including the column selection unit, the result unit character string selection unit is a language dictionary in which predetermined unit character strings are associated with each other and registered, and the result unit character string selection unit was previously selected by the result unit character string selection unit. The simple result The unit character string storage unit for storing a character string, created by the candidate unit character string creation unit described above 1
A degree-of-association detection unit that detects the degree of association between the candidate unit character string and the resulting unit character string stored in the unit character string storage unit using the language dictionary described above; It is characterized by comprising a result unit character string selection unit that selects a result unit character string from the candidate unit character strings based on the degree of association detected by the above-described association degree detection unit.

【００１１】[0011]

【作用】この発明の構成によれば、言語辞書に単位文字
列を相互に関連付けて登録させることができる。また、
認識対象文書中の結果の文字列とされた単位文字列は単
位文字列記憶部に記憶される。According to the structure of the present invention, unit character strings can be registered in the language dictionary in association with each other. Also,
The unit character string that is the resulting character string in the recognition target document is stored in the unit character string storage unit.

【００１２】そして、１以上の候補の単位文字列と単位
文字列記憶部に記憶されている結果の単位文字列との関
連度が前述の言語辞書を用いて検出されこの関連度に基
づいて１以上の候補の単位文字列から結果の単位文字列
が選択される。Then, the degree of association between one or more candidate unit character strings and the resulting unit character string stored in the unit character string storage unit is detected using the aforementioned language dictionary, and 1 is determined based on this degree of association. The resulting unit character string is selected from the above candidate unit character strings.

【００１３】ここで、単位文字列記憶部に登録されてい
る単位文字列は、これから選択しようとする１以上の候
補の単語文字列を抽出する基となっている認識対象文書
中に含まれている文字列である。したがって、このよう
な単語文字列との関連度に基づいて１以上の候補の単語
文字列から結果の候補文字列を選択するこの発明の文字
認識装置は、文書の意味に準じた効果的な後処理が行え
る。Here, the unit character string registered in the unit character string storage unit is included in the recognition target document which is the basis for extracting one or more candidate word character strings to be selected. It is a character string. Therefore, the character recognition device of the present invention, which selects a result candidate character string from one or more candidate word character strings based on the degree of association with such a word character string, is effective in conforming to the meaning of the document. It can be processed.

【００１４】[0014]

【実施例】以下、この発明の文字認識装置の実施例につ
いて説明する。なお、以下の実施例では、単位文字列を
単語（同一文字種のつながった文字列を含む）とした例
で説明を行うので、この発明でいう単位文字列切り出し
部、候補単位文字列作成部、結果単位文字列選択部及び
単位文字列記憶部は、それぞれ単語領域切り出し部、候
補単語作成部、結果単語選択部及び単語記憶部と称し説
明を行う。また、言語辞書は予め定めた単語を相互に関
連付けて登録した単語辞書であるものとする。Embodiments of the character recognition apparatus of the present invention will be described below. In the following examples, the unit character string will be described as an example of a word (including a character string in which the same character type is connected). Therefore, the unit character string cutout unit, the candidate unit character string creation unit according to the present invention, The result unit character string selection unit and the unit character string storage unit are referred to as a word area cutout unit, a candidate word creation unit, a result word selection unit and a word storage unit, respectively, and will be described. The language dictionary is a word dictionary in which predetermined words are registered in association with each other.

【００１５】図１は、実施例の文字認識装置１００の構
成を概略的に示したブロック図である。FIG. 1 is a block diagram schematically showing the configuration of a character recognition device 100 of the embodiment.

【００１６】この実施例の文字認識装置１００は、文字
認識部１１０、単位文字列切り出し部としての単語領域
切り出し部１２０、候補単位文字列作成部としての候補
単語作成部１３０、結果文字列選択部としての結果単語
選択部１４０及び出力端子１５０を具えた構成となって
いる。The character recognition device 100 of this embodiment includes a character recognition unit 110, a word region cutout unit 120 as a unit character string cutout unit, a candidate word creation unit 130 as a candidate unit character string creation unit, and a result character string selection unit. As a result, the word selection unit 140 and the output terminal 150 are provided.

【００１７】また、この文字認識装置１００の文字認識
部１１０は、この実施例の場合図２に示すように、光電
変換部１１１、行切り出し部１１２、文字切り出し部１
１３、サブパタン抽出部１１４、特徴抽出部１１５及び
照合部１１６を具えた構成となっている。In the case of this embodiment, the character recognition unit 110 of the character recognition apparatus 100 includes a photoelectric conversion unit 111, a line cutout unit 112, and a character cutout unit 1 as shown in FIG.
13, a sub-pattern extraction unit 114, a feature extraction unit 115, and a matching unit 116.

【００１８】また、この文字認識装置１００の結果単語
選択部１４０は、この実施例の場合図１に示すように、
単語辞書部１４１、関連度検出部１４２、単語記憶部１
４３及び結果単語判定部１４４を具えた構成となってい
る。Further, the result word selection unit 140 of the character recognition device 100, in the case of this embodiment, as shown in FIG.
Word dictionary unit 141, relevance detection unit 142, word storage unit 1
43 and the result word determination unit 144.

【００１９】次に、上述の各構成成分の詳細についてこ
の文字認識装置１００の動作説明と共に説明する。な
お、この説明を容易とするために、以下の説明において
は認識対象の文書画像例を挙げて説明する。図３はこの
実施例で認識対象文書とした帳票を示した図である。図
３において、２００は帳票、２０１は文字記入枠であ
る。また、後記の表２は、図３に示した認識対象文書中
の入力文字列が以下に説明する一連の処理によって処理
されてゆく様子を説明するための表である。Next, the details of each of the above-mentioned components will be described together with the description of the operation of the character recognition device 100. In order to facilitate this description, an example of a document image to be recognized will be described in the following description. FIG. 3 is a diagram showing a form used as a recognition target document in this embodiment. In FIG. 3, reference numeral 200 is a form, and 201 is a character entry frame. Further, Table 2 described later is a table for explaining how the input character string in the recognition target document shown in FIG. 3 is processed by the series of processes described below.

【００２０】文字、図形、記号等（以下、「文字」とい
う。）が記載された図３に示すような帳票２００からの
光信号Ｓ（図１参照）は、文字認識部１１０内の光電変
換部１１１に入力される。The optical signal S (see FIG. 1) from the form 200 as shown in FIG. 3 in which characters, figures, symbols and the like (hereinafter referred to as “characters”) are written is photoelectrically converted in the character recognition unit 110. It is input to the section 111.

【００２１】この光電変換部１１１は、光信号Ｓを光電
変換し、例えば文字線部を黒画素、背景部を白画素で表
現した白黒２値に量子化された電気信号（これが文書画
像データに相当する。）を生成しこの電気信号を行切り
出し部１１２に出力する。なお、光電変換部１１１は、
例えば従来公知のイメージセンサ等で構成出来る。The photoelectric conversion unit 111 photoelectrically converts the optical signal S, and quantizes it into an electric signal (for example, document image data) which is quantized into a binary black and white in which the character line portion is represented by black pixels and the background portion is represented by white pixels. Corresponding)) and outputs this electric signal to the line segmenting unit 112. The photoelectric conversion unit 111 is
For example, it can be configured by a conventionally known image sensor or the like.

【００２２】行切り出し部１１２は、光電変換部１１１
より入力された文書画像データを、文字行方向を主走査
方向とし、文字列方向を副走査方向として走査し、この
文書画像データから黒画素の分布を作成する。さらに、
この黒画素の分布において、黒画素が「０」から「１」
以上に変化する位置から、「１」以上から「０」に変化
する直前の位置までを、文字行画像データとして切り出
し、この切り出した文字行画像データを文字切り出し部
１１３に出力する。The line segmenting section 112 is a photoelectric conversion section 111.
The input document image data is scanned with the character line direction as the main scanning direction and the character string direction as the sub-scanning direction, and a distribution of black pixels is created from this document image data. further,
In this black pixel distribution, the black pixels are "0" to "1".
From the position changing as described above to the position immediately before changing from “1” or more to “0” is cut out as character line image data, and the cut out character line image data is output to the character cutout unit 113.

【００２３】文字切り出し部１１３は、行切り出し部１
１２より入力された行画像データを、文字行方向を主走
査方向とし、文字列方向を副走査方向として走査し、こ
の行画像データから黒画素の分布を作成する。さらに、
この黒画素の分布において、黒画素が「０」から「１」
以上に変化する位置から、「１」以上から「０」に変化
する直前の位置までを、一文字のパタンデータとして切
り出し、これをサブパタン抽出部１１４に出力する。The character cutout unit 113 is a line cutout unit 1.
The line image data input from 12 is scanned with the character line direction as the main scanning direction and the character string direction as the sub-scanning direction, and a distribution of black pixels is created from this line image data. further,
In this black pixel distribution, the black pixels are "0" to "1".
From the position changing as described above to the position immediately before changing from “1” or more to “0” is cut out as one-character pattern data, and this is output to the sub-pattern extraction unit 114.

【００２４】サブパタン抽出部１１４は、この実施例の
場合、文字切り出し部１１３より入力された文字パタン
を複数の方向に走査し、各走査線上で予め定めた特定の
値ｈ（この実施例ではｈ＝５）以上黒画素が連続してい
る黒画素列を検出する。さらに、該黒画素列をサブパタ
ンの黒画素成分として抽出することにより、文字パタン
から各走査方向別サブパタンを抽出する。そして、これ
らサブパターンを特徴抽出部１１５に順次に出力する。In the case of this embodiment, the sub-pattern extraction unit 114 scans the character pattern input from the character cut-out unit 113 in a plurality of directions, and sets a predetermined value h (h in this embodiment) on each scanning line. = 5) A black pixel row in which black pixels are continuous is detected. Further, by extracting the black pixel row as a black pixel component of the sub-pattern, the sub-pattern for each scanning direction is extracted from the character pattern. Then, these sub-patterns are sequentially output to the feature extraction unit 115.

【００２５】ここでサブパタン抽出部１１４における上
述の複数の走査方向とは、この実施例の場合、文字行方
向（以下、「Ｘ方向」という。図３参照。）に垂直な方
向（以下、「垂直方向」という。）、Ｘ方向に平行な方
向（以下、「水平方向」という。）、Ｘ軸から反時計方
向に４５〓の方向（以下、「左斜め方向」という。）及
びＸ軸から時計方向４５〓の方向（以下、「右斜め方
向」という。）の４つの方向のことである。したがっ
て、サブパタン抽出部１１４は、この実施例の場合、垂
直方向のサブパタン、水平方向のサブパタン、右斜め方
向のサブパタン及び左斜め方向のサブパタンの４つのサ
ブパタンを抽出する。主走査方向を垂直方向として文字
パタンデータを走査しこの走査線上で連続する黒画素
（以下、「黒ラン」という。）を検出し、連続黒画素数
ＬがＬ≧ｈである黒ランを抽出することにより垂直方向
のサブパタンが抽出できる。水平方向、左斜め方向、右
斜め方向の各サブパタンについても、主走査方向を違え
ること以外は垂直方向のサブパタン抽出と同様にして抽
出できる。In the present embodiment, the plurality of scanning directions in the sub-pattern extracting section 114 are the directions (hereinafter, "X direction") perpendicular to the character line direction (hereinafter, "X direction"; see FIG. 3). Vertical direction), a direction parallel to the X direction (hereinafter referred to as "horizontal direction"), a direction 45 〓 counterclockwise from the X axis (hereinafter referred to as "left diagonal direction"), and from the X axis. The four directions are the 45 ° clockwise direction (hereinafter referred to as the “right diagonal direction”). Therefore, in the case of this embodiment, the sub-pattern extracting unit 114 extracts four sub-patterns of a vertical sub-pattern, a horizontal sub-pattern, a right diagonal sub-pattern and a left diagonal sub-pattern. Character pattern data is scanned with the main scanning direction as the vertical direction, and continuous black pixels (hereinafter referred to as “black runs”) are detected on this scanning line, and black runs having a continuous black pixel number L of L ≧ h are extracted. By doing so, the vertical sub-pattern can be extracted. Each of the horizontal, leftward, and rightward sub-patterns can be extracted in the same manner as the vertical sub-pattern extraction except that the main scanning direction is different.

【００２６】特徴抽出部１１５は、この実施例の場合、
サブパタン抽出部１１４から入力された各方向のサブパ
タン上に、文字パタンの文字外接枠に対応する方形領域
を設定し、さらにこの方形領域をＮ×Ｍ個（Ｎ，Ｍは任
意好適な自然数である。）の小領域に分割する。さら
に、この各小領域に含まれる各サブパタンの文字線部分
の長さを各サブパタンの小領域毎の特徴量としてそれぞ
れ抽出する。さらに、これら特徴量を文字外接枠の大き
さで正規化する。そして、正規化した各特徴量ｆ_iから
成る特徴マトリクスを各方向のサブパタン毎に作成しこ
れらを照合部１１６に出力する。In the case of this embodiment, the feature extraction unit 115
On each sub-pattern in each direction input from the sub-pattern extracting unit 114, a rectangular area corresponding to the character circumscribing frame of the character pattern is set, and this rectangular area is N × M (N and M are arbitrary suitable natural numbers). .) Small areas. Further, the length of the character line portion of each sub-pattern included in each small area is extracted as a feature amount for each small area of each sub-pattern. Further, these feature quantities are normalized by the size of the character circumscribing frame. Then, a feature matrix composed of the normalized feature quantities f _i is created for each sub-pattern in each direction, and these are output to the matching unit 116.

【００２７】この実施例の特徴抽出部１１５では、上述
の分割数Ｎ，ＭをＮ＝Ｍ＝８とし、また、上述の、特徴
量の文字外接枠の大きさでの正規化は（ｄＸ＋ｄＹ）／
２なる値で行うものとする。ただし、ｄＸは文字外接枠
の水平方向の長さであり、ｄＹは文字外接枠の垂直方向
の長さである。また、特徴量ｆ_iは、Ｎ×Ｍ個の各小領
域に１〜Ｎ×Ｍまでの番号ｉ（ｉ＝１，２，・・・，Ｎ
×Ｍ）を順次に付して各小領域を表したときに番号ｉの
小領域の特徴量である。特徴量ｆ_iは特徴マトリクスの
要素値になる。In the feature extraction unit 115 of this embodiment, the above-mentioned division numbers N and M are set to N = M = 8, and the above-mentioned normalization of the feature amount by the size of the character circumscribing frame is (dX + dY). /
The value shall be 2. However, dX is the horizontal length of the character circumscribing frame, and dY is the vertical length of the character circumscribing frame. Further, the feature quantity f _i is the number i (i = 1, 2, ..., N) from 1 to N × M in each of the N × M small areas.
XM) is a feature amount of the small area of number i when each small area is represented by being sequentially attached. The feature quantity f _i becomes the element value of the feature matrix.

【００２８】照合部１１６は、特徴抽出部１１５から入
力された特徴マトリクスＦを、図示せぬ予め用意された
標準パタンの特徴マトリクスＧと照合し、下記の（１）
式で表される類似度Ｒを求める。さらに、類似度Ｒが予
め定めた値Ｐ以上である辞書マトリクスの文字名を候補
文字名とし、さらに、類似度の高い順に第１位候補文
字、第２位候補文字というように順位付けを行い、これ
ら１以上の候補文字を認識結果候補文字として単語領域
切り出し部１２０に出力する。The collation unit 116 collates the feature matrix F input from the feature extraction unit 115 with the feature matrix G of a standard pattern (not shown) prepared in advance, and the following (1)
The similarity R represented by the formula is obtained. Further, the character names of the dictionary matrix whose similarity R is equal to or greater than a predetermined value P are set as candidate character names, and the ranking is performed in order of the highest similarity, such as first candidate character and second candidate character. The one or more candidate characters are output to the word area cutout unit 120 as recognition result candidate characters.

【数１】[Equation 1]

【００２９】 [0029]

【００３０】但し、（１）式において、ｇ_iは辞書マト
リクスの要素を示す。However, in the equation (1), g _i represents an element of the dictionary matrix.

【００３１】単語領域切り出し部１２０は、文字認識部
１１０の照合部１１６より出力された候補文字を、図示
せぬ記憶部に順次格納し候補文字列を作成する。図３に
示した認識対称文書の場合、その中の入力文字列（後記
の表２の（Ａ）欄参照。）について文字認識部１１０に
よって抽出された認識結果候補文字に基づいて単語切り
出し部１２０は、例えば、後記の表２の（Ｂ）欄に示す
３種類の候補文字列を作成する。The word area cutout unit 120 sequentially stores the candidate characters output from the collation unit 116 of the character recognition unit 110 in a storage unit (not shown) to create a candidate character string. In the case of the recognition symmetric document shown in FIG. 3, the word cutout unit 120 is based on the recognition result candidate characters extracted by the character recognition unit 110 for the input character string (refer to column (A) of Table 2 below) therein. Creates, for example, three types of candidate character strings shown in column (B) of Table 2 below.

【００３２】次に、この単語領域切り出し部１２０は、
ここで作成した候補文字列（表２の（Ｂ）欄参照。）よ
り、単語に相当する領域（この領域を「単語領域」と称
する。）を順次に検出し、その領域の候補文字列部分を
順次に候補単語作成部１３０に出力する。Next, the word area cutout unit 120
A region corresponding to a word (this region is referred to as a "word region") is sequentially detected from the candidate character string created here (see column (B) of Table 2), and the candidate character string portion of the region is detected. Are sequentially output to the candidate word creation unit 130.

【００３３】単語領域の検出は、この実施例の場合、単
語候補文字列の第１位の候補文字の文字種に基づいて行
う。すなわち、単語領域の検出に当り検出対象領域の先
頭の文字の第１位候補文字の文字種と同一の文字種の文
字が連続して第１位候補文字となっているところまでを
一つの単語領域として切り出す。例えば表２（Ｂ）欄の
場合であれば、先頭文字「パ」はカタカナであるのでカ
タカナの文字種が第１位候補文字となっている「パター
ン」までが最初の単語領域ということになる。なお、こ
のような単語領域切り出し方法では、助詞が連なった場
合等の平仮名の領域が１つの単語領域として検出される
ので、本来の単語の概念と異なる単語領域が切り出され
る場合もある。そして、このように切り出された領域
は、後に行われる後処理の対象とされない可能性があ
る。しかし、一般に、普通単語、或いは特定分野の専門
用語等の単語は、漢字の連なり、カタカナの連なり、英
字の連なり等で構成される場合が大多数であるので、こ
の実施例で用いた単語領域切り出し方法であっても発明
の効果を損ねるようなことはない。In the case of this embodiment, the word region is detected based on the character type of the first candidate character in the word candidate character string. That is, when detecting a word region, one word region is formed up to the point where a character of the same character type as the first candidate character of the first character of the detection target region becomes a first candidate character continuously. cut. For example, in the case of the table 2 (B) column, since the first character "pa" is katakana, the first word area is up to "pattern" in which the character type of katakana is the first candidate character. In such a word area cutout method, a hiragana area such as a case where particles are consecutive is detected as one word area, and thus a word area different from the original concept of a word may be cut out. Then, the region thus cut out may not be the target of post-processing performed later. However, in general, most of the words such as ordinary words or technical terms in a specific field are composed of a series of Chinese characters, a series of Katakana, a series of English characters, etc., so the word region used in this embodiment is used. The cutting method does not impair the effects of the invention.

【００３４】表２の（Ｂ）欄に掲げた候補文字列から
は、この実施例の単語領域切り出し部１２０は、表２の
（Ｃ）欄に示したように単語領域を切り出し、これを候
補単語作成部１３０に順次に出力する。From the candidate character strings listed in the column (B) of Table 2, the word region cutout unit 120 of this embodiment cuts out the word regions as shown in the column (C) of Table 2 and selects them. The words are sequentially output to the word creating unit 130.

【００３５】候補単語作成部１３０は、単語領域切り出
し部１２０より入力された各単語領域の候補文字列を組
み合わせて単語候補文字列を作成し、さらに、これら単
語候補文字列より単語として妥当なものを選択し、そし
て選択した単語候補文字列を候補単語として結果単語選
択部１４０内の関連度検出部１４２に出力する。The candidate word creating section 130 creates a word candidate character string by combining the candidate character strings of the respective word areas input from the word area cutting section 120, and further, a word that is more appropriate as a word than these word candidate character strings. Is selected, and the selected word candidate character string is output as a candidate word to the degree-of-association detecting unit 142 in the result word selecting unit 140.

【００３６】単語候補文字列が単語として妥当なものか
どうかの判定は、この実施例では、当該単語候補文字列
が結果単語選択部１４０内の単語辞書１４１（詳細な説
明は後述する。）内に項目として存在するか否かにより
行う。もし、単語として妥当なものが一つも得られなか
った場合には、候補単語作成部１３０は、第１位の候補
文字を組み合わせて作成した単語候補文字列を関連度検
出部１４２に出力するものとする。In this embodiment, it is determined whether the word candidate character string is valid as a word in the word dictionary 141 (detailed description will be given later) of the word candidate character string in the result word selection unit 140. Depending on whether or not it exists as an item in. If no valid word is obtained, the candidate word creation unit 130 outputs a word candidate character string created by combining the first-ranked candidate characters to the degree-of-association detection unit 142. And

【００３７】入力文字列を例えば図３に示した認識対象
文書中の一部分である「文字」とした例で考えると（後
記の表３の（Ａ）欄参照。）、単語領域切り出し部１２
０、候補単語作成部１３０では、次のようにして候補単
語が作成される。Considering an example in which the input character string is a "character" that is a part of the recognition target document shown in FIG. 3 (see the column (A) of Table 3 below), the word area cutout unit 12 is used.
0, the candidate word creation unit 130 creates candidate words as follows.

【００３８】まず、入力文字列「文字」から単語領域切
り出し部１２０によってこの実施例の場合「大」、
「文」、「丈」及び「字」、「学」、「宇」なる候補文
字列が作成される（後記の表３の（Ｂ）欄参照。）。そ
して、候補単語作成部１３０は、これらの候補文字列中
の候補文字「大」、「文」、「丈」、「字」、「学」及
び「宇」の各文字を組み合わせて、後記の表３の（Ｃ）
欄に示すような九つの単語候補文字列を作成し、さら
に、これら単語候補文字列が単語辞書部１４１内に項目
として存在するものか否かそれぞれ判定し存在した場合
はその単語候補文字列を候補単語とする。表３の（Ｃ）
欄の九つの単語候補文字列からは表３の（Ｄ）欄に示し
たように「大字」、「大学」、「文字」及び「文学」の
４つの候補単語が選択される。これら候補単語は結果単
語選択部１４０内の関連度検出部１４２に出力される。First, in the case of this embodiment, the word area cutout unit 120 extracts "large" from the input character string "character",
Candidate character strings “sentence”, “length” and “letter”, “gaku”, “U” are created (see column (B) of Table 3 below). Then, the candidate word creating unit 130 combines the candidate characters “large”, “sentence”, “length”, “letter”, “gaku” and “U” in these candidate character strings, and (C) in Table 3
Nine word candidate character strings shown in the column are created, and it is further determined whether or not these word candidate character strings exist as items in the word dictionary unit 141, and if they exist, the word candidate character strings are selected. Let it be a candidate word. (C) in Table 3
From the nine word candidate character strings in the column, four candidate words “large character”, “university”, “letter” and “literature” are selected as shown in column (D) of Table 3. These candidate words are output to the degree-of-association detection unit 142 in the result word selection unit 140.

【００３９】なお、単語辞書部１４１は、この実施例で
は、後述の関連性検定の際に使用する単語辞書と、上述
の候補単語の選択のために用いる単語辞書とを兼ねた構
成としてある。そして、関連性検定を容易にするため
に、この単語辞書１４１は、各項目が体系的に分類され
相互に関連づけられたシソーラス構造の単語辞書として
ある。図４及び図５は実施例の単語辞書部１４１の説明
に供する図である。特に図４は、実施例の単語辞書部１
４１の構成を概念的に示した図である。図４において、
矢印はその始点の単語と終点の単語とが相互に関連して
いることを示している。さらに、図４において等号は２
つの単語が同義語であることを示す。また、図５は認識
対象文書中の「認識」及び「文字」についての辞書の記
述例を示したものである。図５中、ＢＴは当該単語
（「認識」や「文字」等の単語のこと。）の上位語を、
またＮＴは下位語を示す。In this embodiment, the word dictionary unit 141 has a structure that serves as both a word dictionary used in the later-described relevance test and a word dictionary used for selecting the above-mentioned candidate words. In order to facilitate the association test, the word dictionary 141 is a thesaurus-structured word dictionary in which each item is systematically classified and associated with each other. 4 and 5 are diagrams for explaining the word dictionary unit 141 of the embodiment. In particular, FIG. 4 shows the word dictionary unit 1 of the embodiment.
It is the figure which showed the structure of 41 notionally. In FIG.
The arrow indicates that the word at the start and the word at the end are related to each other. Further, in FIG. 4, the equal sign is 2
Indicates that two words are synonyms. Further, FIG. 5 shows a description example of a dictionary for "recognition" and "characters" in the document to be recognized. In FIG. 5, BT is a superordinate word of the word (a word such as “recognition” or “character”).
Also, NT indicates a lower word.

【００４０】なお、単語辞書部１４１は、検索をより容
易とするために、各項目に例えば先頭文字と文字数とに
よって検索できるようなマップを具えた辞書部とするの
がより好適である。また、この実施例では、単語辞書部
１４１は後述の関連性検出を行う際と、上述の単語とし
て妥当なものかどうかの単語選択を行う際とでそれぞれ
使用できる構成としてあるが、これに限られるものでは
なく、関連性検出を行う際と単語選択を行う際とでそれ
ぞれ別々に辞書部を用意するようにしても良い。その場
合は、単語候補文字列が単語として妥当か否かを検定す
るための辞書については、項目だけを検索できるような
簡単な構成とすることができる。さらに、この単語辞書
部は候補単語作成部１３０内に設けるような構成として
も勿論良い。In order to facilitate the search, the word dictionary unit 141 is more preferably a dictionary unit having a map that allows each item to be searched by, for example, the first character and the number of characters. In addition, in this embodiment, the word dictionary unit 141 has a configuration that can be used for both the later-described relevance detection and the above-described word selection for determining whether or not the word is appropriate, but the present invention is not limited to this. However, the dictionary units may be separately prepared for the relevance detection and the word selection. In that case, the dictionary for testing whether or not the word candidate character string is valid as a word can have a simple structure in which only items can be searched. Further, the word dictionary unit may of course be provided in the candidate word creating unit 130.

【００４１】また、結果単語選択部１４０の関連度検出
部１４２は、候補単語作成部１３０から入力された候補
単語数及び後述する単語記憶部１４３に記憶されている
単語数により以下のように動作する。なお、単語記憶部
１４３は、結果単語判定部１４４により最終的に結果単
語として選択された単語を記憶しておくものであり、例
えばメモリ、フレキシブルディスク、ハードディスク等
の種々の記憶媒体で容易に実現できる。The degree-of-association detecting unit 142 of the result word selecting unit 140 operates as follows according to the number of candidate words input from the candidate word creating unit 130 and the number of words stored in the word storage unit 143 described later. To do. The word storage unit 143 stores the word finally selected as the result word by the result word determination unit 144, and is easily realized by various storage media such as a memory, a flexible disk, a hard disk, and the like. it can.

【００４２】まず、候補単語抽出部から入力された候補
単語が１つであった場合には、その候補単語をそのまま
結果単語判定部１４４に出力する。First, when the number of candidate words input from the candidate word extraction unit is one, the candidate word is output to the result word determination unit 144 as it is.

【００４３】これに対し候補単語が２以上であった場合
で、単語記憶部１４３に未だ何ら単語が記憶されていな
い場合には、候補単語作成部１３０より入力された候補
単語全てに同一の関連度（例えば「１」）を付加して結
果単語判定部１４４に出力する。On the other hand, when the number of candidate words is two or more and no word is stored in the word storage unit 143, the same relation is applied to all the candidate words input from the candidate word creation unit 130. A degree (for example, “1”) is added and output to the result word determination unit 144.

【００４４】また、候補単語が２以上であった場合で、
単語記憶部１４３に何らかの単語が既に記憶されている
場合には、各候補単語を、単語記憶部１４３に記憶され
ている既に結果単語として選択された単語各々と比較
し、各々関連度を検出し、各候補単語について前記記憶
されている単語全てとの関連度の平均値を算出し、各候
補単語及び前記算出された平均の関連度を結果単語判定
部１４４に出力する。If there are two or more candidate words,
If any word is already stored in the word storage unit 143, each candidate word is compared with each of the words already selected as the result words stored in the word storage unit 143, and the degree of association is detected. For each candidate word, the average value of the degree of association with all of the stored words is calculated, and each candidate word and the calculated average degree of association are output to the result word determination unit 144.

【００４５】なお、前記２つの単語相互の関連度は、前
述したシソーラス構造の単語辞書１４１中において、該
２つの項目の距離、つまり２つの項目を結ぶ最少の枝の
数によって表すものとする。そして、関連度は距離が小
さいほど（枝の数が少ないほど）強いものとする。例え
ば、図４において、項目「文字」と項目「パタン」とは
一本の枝（矢印）で直接結ばれているので関連度は
「１」と表すものとし、また、項目「大学」と項目「認
識」とは、「大学」−「芸術」−「文学」−「本」−
「文字」−「認識」と５本の矢印を経ているので、関連
度は５と表すものとする。The degree of association between the two words is represented by the distance between the two items, that is, the minimum number of branches connecting the two items in the word dictionary 141 having the thesaurus structure. The degree of association is stronger as the distance is smaller (the number of branches is smaller). For example, in FIG. 4, since the item “character” and the item “pattern” are directly connected by one branch (arrow), the degree of association is represented as “1”, and the item “university” and the item "Cognition" means "university"-"art"-"literature"-"book"-
Since “character”-“recognition” has been passed through five arrows, the degree of association is represented as 5.

【００４６】例えば、表３に例示した入力文字列「文
字」を認識しようとする場合で、表３（Ｄ）欄のように
「大字」、「大学」、「文字」及び「文学」の４つの候
補単語が抽出されている場合で、かつ、表２の（Ｄ）欄
のように「パタン」及び「認識」の２つの単語が結果単
語として単語記憶部１４３に認識されている場合は、図
４に示すような単語辞書を用いて各候補単語と各結果単
語との関連度を検出した結果は、表３の（Ｅ）欄のよう
になる。For example, in the case of recognizing the input character string "character" illustrated in Table 3, as shown in the column of Table 3 (D), "large character", "university", "character" and "literature" are displayed. When two candidate words are extracted and when two words “pattern” and “recognition” are recognized as result words in the word storage unit 143 as in the column (D) of Table 2, The result of detecting the degree of association between each candidate word and each result word using the word dictionary as shown in FIG. 4 is as shown in column (E) of Table 3.

【００４７】なお、関連度の検出方法つまり項目間の距
離の検出方法は図５に示すような記述形式の辞書を用い
る従来方法により簡単に求めることができるのでその説
明は省略する。The method of detecting the degree of association, that is, the method of detecting the distance between items can be easily obtained by a conventional method using a dictionary having a description format as shown in FIG.

【００４８】結果単語判定部１４４は、関連度検出部１
４２から入力される候補単語およびこれら候補単語に付
加された関連度に基づいて、候補単語の中から結果単語
として最も確からしい候補単語を最終的な認識結果単語
として選択し出力端子１３０に出力すると同時に該単語
が単語記憶部に既に記憶されているか否かを検定し、未
だ記憶されていない単語であった場合には該単語を単語
記憶部１４３に新たに記憶する処理を行うものである。
ここで、この実施例では、結果単語として最も確からし
い候補単語の選択を以下のように行う。The result word determination unit 144 is the relevance detection unit 1.
Based on the candidate words input from 42 and the degree of relevance added to these candidate words, the candidate word most likely as the result word is selected as the final recognition result word from the candidate words and output to the output terminal 130. At the same time, whether or not the word is already stored in the word storage unit is tested, and if the word is not stored yet, the word is newly stored in the word storage unit 143.
Here, in this embodiment, the most probable candidate word as a result word is selected as follows.

【００４９】まず、関連度検出部１４２から出力された
候補単語が１であった場合には結果単語判定部１４４は
この候補単語を結果単語として選択する。First, when the candidate word output from the degree-of-association detecting unit 142 is 1, the result word determining unit 144 selects this candidate word as the result word.

【００５０】これに対し、関連度検出部１４２から入力
された候補単語が２以上であった場合には、結果単語判
定部１４４はこれら候補単語のうち最も関連度の強いも
の、つまり当該単語と単語記憶部１４３に記憶されてい
る単語各々との前記シソーラス構造の単語辞書中での距
離の平均が最も小さいものを結果単語として選択する。On the other hand, when the number of candidate words input from the degree-of-association detecting unit 142 is two or more, the result word determining unit 144 determines that the candidate word having the highest degree of relevance, that is, the relevant word. A result word having the smallest average distance from each word stored in the word storage unit 143 in the word dictionary having the thesaurus structure is selected as a result word.

【００５１】なお、関連度の強さが最も強いものが複数
存在した場合には、結果単語判定部１４４は、結果単語
判定部１４４に第１番目に入力された候補単語（平均候
補順位の最も上位の候補単語）を結果単語として選択す
る。If there are a plurality of words having the highest degree of association, the result word determination unit 144 determines that the candidate word input first to the result word determination unit 144 (the average candidate rank is the highest). The upper candidate word) is selected as the result word.

【００５２】表３（Ｅ）欄の例では、入力文字列「文
字」の４個の候補単語中で「文字」が、既に認識された
「パタン」および「認識」の各単語との距離が最も小さ
いので、関連度が強いと判定される。したがって、単語
「文字」が結果単語として選択され出力端子１５０に出
力されると同時に単語記憶部１４３に記憶される。In the example of the column of Table 3 (E), the "character" among the four candidate words of the input character string "character" is the distance from the already recognized "pattern" and "recognition" words. Since it is the smallest, it is determined that the degree of association is strong. Therefore, the word “character” is selected as the result word, is output to the output terminal 150, and is simultaneously stored in the word storage unit 143.

【００５３】出力端子１５０は認識結果を外部に出力す
るためのデータ出力端子である。この出力端子１５０を
介し、この発明の文字認識装置をその他のシステム（例
えばや文字認識結果を利用して翻訳をする装置等）、文
字認識結果を記憶する媒体、文字認識結果を他の場所の
別のシステムに通信する通信網、そのほかの情報処理シ
ステム等と接続することができる。The output terminal 150 is a data output terminal for outputting the recognition result to the outside. Via this output terminal 150, the character recognition device of the present invention can be used for other systems (for example, a device for translating using the character recognition result, etc.), a medium for storing the character recognition result, a character recognition result for other places It can be connected to a communication network that communicates with another system or other information processing system.

【００５４】上述においては、この発明の文字認識装置
の実施例について説明したが、この発明は上述の実施例
のみに限られるものではなく、各構成成分の動作、各構
成成分での処理方法、各構成成分間の入出力信号の流
れ、各構成成分の配設個数及び位置を任意好適に変更で
きる。In the above, the embodiment of the character recognition device of the present invention has been described, but the present invention is not limited to the above-mentioned embodiment, the operation of each component, the processing method by each component, The flow of input / output signals between the constituent components and the number and position of the constituent components can be arbitrarily changed.

【００５５】例えば、上述の実施例では、文字認識部１
１０を図２に示すような構成としていた。これは、文字
認識方法として上述のように方向別のサブパタンを抽出
し該サブパタンより文字線の長さを示す特徴量を抽出
し、該特徴量を辞書と照合する方法を用いていたことに
よる。しかし、文字認識の方法はこれに限られず従来公
知の他の方法でも良い。したがって、文字認識部の構成
も用いる方法に適した構成に変更できる。For example, in the above embodiment, the character recognition unit 1
10 is configured as shown in FIG. This is because, as the character recognition method, the method of extracting the sub-patterns for each direction as described above, extracting the feature amount indicating the length of the character line from the sub-pattern, and collating the feature amount with the dictionary is used. However, the character recognition method is not limited to this, and other conventionally known methods may be used. Therefore, the configuration of the character recognition unit can be changed to a configuration suitable for the method used.

【００５６】また、単語辞書は図４、図５に示した例に
限られない。他の検索方法が可能なもの、他の方法で項
目分けがされたもの、付加データが追加されたもの、他
の方法で単語相互の関連性が付けられているものなど、
任意好適な単語辞書であることができる。The word dictionary is not limited to the examples shown in FIGS. Other search methods are possible, items are classified by other methods, additional data is added, words that are related to each other by other methods, etc.
It can be any suitable word dictionary.

【００５７】また、上述の実施例では単位文字列を単語
（同一文字種のつながった文字列を含む）とした例で説
明を行なった。しかし、単位文字列を単語、複合語また
は文節とする等などのように他の文字列とした場合でも
この発明を適用できることは明らかである。In the above embodiment, the unit character string is a word (including a character string in which the same character type is connected). However, it is obvious that the present invention can be applied even when the unit character string is another character string such as a word, a compound word, or a clause.

【００５８】[0058]

【発明の効果】上述した説明からも明らかなように、こ
の発明の文字認識装置によれば、１以上の候補の単位文
字列から結果文字列を選択することを、これから選択し
ようとする１以上の候補の単語文字列を抽出する基とな
っている認識対象文書中に含まれている他の単位文字列
との関連度に基づいて行うことができる。したがって、
１以上の候補の単位文字列から認識対象文書の意味に最
も即した単位文字列を選択するという効果的な後処理
を、文字単位の認識とは独立にできる。しかも、この後
処理は、単語相互の接続を構文解析や意味解析の手法に
より検定する方法に比べ簡単である。As is apparent from the above description, according to the character recognition device of the present invention, selecting a result character string from one or more candidate unit character strings is one or more to be selected. Can be performed based on the degree of association with another unit character string included in the recognition target document that is the basis for extracting the candidate word character string. Therefore,
Effective post-processing of selecting a unit character string that most closely matches the meaning of a recognition target document from one or more candidate unit character strings can be performed independently of character unit recognition. Moreover, this post-processing is easier than the method of verifying the mutual connection of words by the method of syntax analysis or semantic analysis.

【００５９】これがため、精度の高い認識を簡単な操作
で行えるので、大量の文書でも高速にかつ効率良く入力
することが可能な高性能な文字認識装置を提供すること
ができる。Therefore, since highly accurate recognition can be performed by a simple operation, it is possible to provide a high-performance character recognizing device which can input a large amount of documents at high speed and efficiently.

【表１】[Table 1]

【００６０】 [0060]

【００６１】[0061]

【表２】[Table 2]

【００６２】 [0062]

【００６３】[0063]

【表３】[Table 3]

【００６４】 [0064]

[Brief description of drawings]

【図１】この発明の文字認識装置の一実施例を説明する
構成図である。FIG. 1 is a configuration diagram illustrating an embodiment of a character recognition device of the present invention.

【図２】実施例の文字認識装置に備わる文字認識部の構
成を示す図である。FIG. 2 is a diagram showing a configuration of a character recognition unit included in the character recognition device according to the embodiment.

【図３】認識対象文書の一例の帳票例の説明図である。FIG. 3 is an explanatory diagram of a form example of an example of a document to be recognized.

【図４】実施例の単語辞書部の構成を概念的に示した図
である。FIG. 4 is a diagram conceptually showing the structure of a word dictionary unit in the embodiment.

【図５】実施例の単語辞書の記述例を示した図である。FIG. 5 is a diagram showing a description example of a word dictionary according to the embodiment.

[Explanation of symbols]

Ｓ…文書からの光信号１００：実施例の文字認識装置１１０：文字認識部１２０：単語領域切り出し部（単位文字列切り出し部）１３０：候補単語作成部（候補単位文字列作成部）１４０：結果単語選択部（結果単位文字列選択部）１４１：単語辞書（言語辞書）１４２：単語記憶部（単位文字列記憶部）１４３：関連度検出部１４４：結果単語判定部（結果の単位文字列選択部）１５０：出力端子２００：帳票２０１：文字記入枠 S ... Optical signal from document 100: Character recognition device of the embodiment 110: Character recognition unit 120: word area cutout part (unit character string cutout part) 130: Candidate word creation unit (candidate unit character string creation unit) 140: Result word selection unit (result unit character string selection unit) 141: word dictionary (language dictionary) 142: word storage unit (unit character string storage unit) 143: Relevance detection unit 144: Result word determination unit (result unit character string selection unit) 150: Output terminal 200: Report 201: Character entry frame

Claims

[Claims]

1. Document image data is obtained from a document on a medium,
A character recognition unit that recognizes a character string in the document from the document image data to obtain a candidate character string in which each character is one or more candidate characters, and an area corresponding to a unit character string is cut out from the candidate character string. A unit character string cutout unit, a candidate unit character string creation unit that extracts one or more candidate unit character strings from the cutout candidate character strings in each region, and a unit that is a result from the extracted candidate unit character strings. In a character recognition device comprising a result unit character string selection unit for selecting a character string, the result unit character string selection unit includes a language dictionary in which predetermined unit character strings are registered in association with each other, and the result unit character string. A unit character string storage unit that stores a unit character string that is a result previously selected by the selection unit, one or more candidate unit character strings created by the candidate unit character string creation unit, and the unit character string storage unit. Record A degree-of-association detection unit that detects the degree of association with the stored unit character string using the language dictionary, and a degree of relevance detected by the degree-of-association detection unit from the one or more candidate unit character strings. A character recognition device comprising: a result unit character string selection unit that selects a result unit character string based on the result.

2. The character recognition device according to claim 1, wherein the language dictionary has a thesaurus structure language in which predetermined unit character strings are hierarchically classified and the unit character strings are arranged in a tree structure or a network structure. A character recognition device characterized by being a dictionary.

3. The character recognition device according to claim 1, wherein the degree-of-association detection unit includes a unit character string of a superordinate concept and a subordinate concept of a unit character string registered in advance in association with each other in the language dictionary. Character recognition characterized in that the distance to the unit character string is 1, and the distance of the shortest path connecting the candidate unit character string and the unit character string registered in the language dictionary is detected as the degree of association. apparatus.

4. The character recognition device according to claim 1, wherein the unit character string selection unit selects one of the result units stored in the unit character string storage unit among the one or more candidate unit character strings. Select the candidate unit character string with the highest degree of association with the unit character string, or the candidate unit character string with the highest average relevance value as the result unit character string if there are multiple result unit strings. Character recognition device characterized by being a thing.

5. The character recognition device according to claim 1, wherein the unit character string is a connected character string of the same character type,
A character recognition device, which is a character string formed by combining one or more of a word, a compound word, and a clause.