JPH11134439A

JPH11134439A - Method for recognizing word

Info

Publication number: JPH11134439A
Application number: JP9298445A
Authority: JP
Inventors: Takayoshi Yoshida; 隆義吉田; Koichi Higuchi; 浩一樋口
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-10-30
Filing date: 1997-10-30
Publication date: 1999-05-21

Abstract

PROBLEM TO BE SOLVED: To provide a word recognizing method with a reduced recognizing time. SOLUTION: A word pattern is divided into three areas of an upper pattern, middle pattern, and lower pattern by two horizontal division lines. Then, all character upper patterns are detected by classifying them into plural kinds based on the shape of the horizontal peripheral distribution and local vertical peripheral distribution of the upper pattern, and a word upper part pattern code in which codes indicating the number and kind of detection are arranged in the order of the positions of detection is obtained. Moreover, all character lower patterns are detected by classifying them into plural kinds based on the shape, and a word lower pattern code in which codes indicating the number and kind of the detection of the lower pattern are arranged in the order of the positions of detection is obtained. Then, more than one word candidates are selected by retrieving a word recognition dictionary according to the upper and lower pattern codes prepared from the inputted word pattern.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、印刷欧文の文字
パタンを読み取る文字認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition method for reading a character pattern of a printed European language.

【０００２】[0002]

【従来の技術】従来、この種の文字認識方法には、文
献：特開昭５７−２３１８５「文字認識方式」に開示さ
れるものがあり、図７はその方法を用いた文字認識装置
の構成図である。入力端子１から文字パタンメモリ２に
１文字分の文字パタンが二値画像として入力される。文
字枠分割部３は、入力文字パタンの外接枠を検出し、そ
の内部領域をＭ×Ｎマトリクスに分割するための分割点
を求める。2. Description of the Related Art A conventional character recognition method of this type is disclosed in Japanese Patent Laid-Open Publication No. Sho 57-23185, "Character Recognition Method". FIG. 7 shows a configuration of a character recognition apparatus using the method. FIG. A character pattern for one character is input from an input terminal 1 to a character pattern memory 2 as a binary image. The character frame dividing unit 3 detects a circumscribed frame of the input character pattern, and obtains a dividing point for dividing the internal region into an M × N matrix.

【０００３】また、水平サブパタン抽出部４は入力文字
パタンから水平方向のストローク成分を抽出して水平サ
ブパタンを作成する。図８（ａ）に水平サブパタン（Ｈ
ＳＰ）を示す。同様に、垂直、左斜め、右斜めのサブパ
タン抽出部５、６、７が、入力文字パタンから各方向の
ストローク成分を抽出し、図８（ｂ）（ｃ）（ｄ）のよ
うなサブパタンを作成する。The horizontal sub-pattern extracting section 4 extracts horizontal stroke components from an input character pattern to create a horizontal sub-pattern. FIG. 8A shows a horizontal sub-pattern (H
SP). Similarly, vertical, diagonally left, and diagonally right sub-pattern extraction units 5, 6, and 7 extract stroke components in each direction from the input character pattern, and generate sub-patterns as shown in FIGS. 8B, 8C, and 8D. create.

【０００４】特徴マトリクス抽出部８では、文字枠分割
部３で得られた分割点を使ってこれら４個のサブパタン
をマトリクスに分割し、各領域中のストロークの長さを
数値化し、図９に示すような（ａ）水平マトリクスＨ、
（ｂ）垂直マトリクスＶ、（ｃ）左斜めマトリクスＬ、
（ｄ）右斜めマトリクスＲを求める。The feature matrix extraction unit 8 divides these four sub-patterns into a matrix using the division points obtained by the character frame division unit 3 and quantifies the stroke length in each area. (A) horizontal matrix H as shown,
(B) vertical matrix V, (c) left oblique matrix L,
(D) Obtain a right oblique matrix R.

【０００５】これらを併合した４×（Ｍ×Ｎ）要素のマ
トリクスは、それが何の文字か（カテゴリ）を識別し得
る特徴を表すので特徴マトリクスと呼ぶ。この特徴マト
リクスは、文字パタンを構成するストロークの方向、位
置、長さ等のカテゴリ特有の性質を表しているが、文字
の大きさ、線幅、字形の変動などは特徴マトリクスの値
に影響しないように変換されている。A matrix of 4 × (M × N) elements obtained by merging these elements is called a feature matrix because it represents a feature that can identify what character (category) it is. This feature matrix expresses the characteristics peculiar to the category such as the direction, position, and length of the stroke constituting the character pattern, but the character size, line width, variation in the character shape, etc. do not affect the value of the feature matrix. Has been converted to

【０００６】文字照合識別部９では、入力文字パタンか
ら得られた特徴マトリクスと、読み取り対象とする全文
字集合のカテゴリ（文字）単位にあらかじめ標準文字パ
タンから以上述べたのと同じ手順によって作成しマトリ
クス辞書１０に格納した特徴マトリクスとの照合を行
い、最も類似するマトリクスを持ったカテゴリを選び出
す。The character collation / identification unit 9 creates a feature matrix obtained from an input character pattern and a standard character pattern for each category (character) of the entire character set to be read in the same procedure as described above. A comparison is made with the feature matrix stored in the matrix dictionary 10 to select a category having the most similar matrix.

【０００７】それには、特徴マトリクス間の距離を対応
する要素の差の２乗和（またはその平方根）によって評
価し、この距離が最も０に近い１個または数個のカテゴ
リを候補文字として選定し、その文字コードを出力端子
１１から出力する。In order to do so, the distance between feature matrices is evaluated by the sum of squares (or the square root) of the difference between corresponding elements, and one or several categories whose distance is closest to 0 are selected as candidate characters. , And outputs the character code from the output terminal 11.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、上記の
文字認識方法では、入力文字パタンは１文字単位に入力
され、特徴マトリクスを求めて辞書に記憶された全ての
文字パタンと照合するため、また、１個の欧文単語を認
識する場合、この文字認識処理を単語の構成文字数だけ
繰り返す必要があるため、単語の認識に多くの時間を費
やしていた。However, in the above-described character recognition method, the input character pattern is input in units of one character, and a character matrix is obtained and collated with all the character patterns stored in the dictionary. In the case of recognizing one European word, it is necessary to repeat this character recognition process by the number of characters constituting the word, so that much time has been spent for word recognition.

【０００９】本発明は、単語の認識時間を短縮できる認
識方法を提供することを目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a recognition method capable of shortening a word recognition time.

【００１０】[0010]

【課題を解決するための手段】前記目的を達成するため
に、本願発明では、印刷体の欧文文字を構成要素とする
単語パタンを２本の水平分割線によって上部パタン、中
部パタン、下部パタンの３領域に分割し、前記上部パタ
ンの水平周辺分布及び局所的垂直周辺分布より、前記欧
文文字の文字上部パタンを形状により複数種類に分類し
て全て検出し、当該上部パタンの検出個数及び当該種類
を表す符号を当該検出位置の順序で並べた単語上部パタ
ン符号を求め、前記下部パタンの水平周辺分布及び局所
的垂直周辺分布より、前記欧文文字の文字下部パタンを
形状により複数種類に分類して全て検出し、当該下部パ
タンの検出個数及び当該種類を表す符号を当該検出位置
の順序で並べた単語下部パタン符号を求め、前記単語上
部パタン符号及び単語下部パタン符号を用いて、認識対
象の単語を分類した単語認識用辞書を予め作成し、入力
された単語パタンより前記単語の上部及び下部パタン符
号を求め、当該両パタン符号により前記単語認識用辞書
を索引して１個以上の単語候補を選定することを特徴と
している。In order to achieve the above object, according to the present invention, a word pattern having European characters on a printed body as a component is divided into two patterns by an upper pattern, a middle pattern, and a lower pattern. The upper pattern is divided into three regions, and from the horizontal peripheral distribution and the local vertical peripheral distribution of the upper pattern, the character upper patterns of the European characters are classified into a plurality of types according to their shapes and all are detected, and the number of detected upper patterns and the relevant type are detected. Is obtained by arranging the codes representing the characters in the order of the detection positions, and from the horizontal peripheral distribution and the local vertical peripheral distribution of the lower pattern, the character lower patterns of the European characters are classified into a plurality of types by shape. All words are detected, and a code representing the number of detections and the type of the lower pattern are arranged in the order of the detection positions to obtain a word lower pattern code, and the word upper pattern code and Using a word lower pattern code, a word recognition dictionary in which words to be recognized are classified is created in advance, and the upper and lower pattern codes of the word are obtained from the input word pattern. It is characterized in that one or more word candidates are selected by indexing a dictionary.

【００１１】[0011]

BEST MODE FOR CARRYING OUT THE INVENTION

［実施例の説明］［構成の説明］本発明の単語認識方法は、従来のよ
うな文字単位の認識ではなく、単語全体のパタンに含ま
れる特徴を用いて単語を直接認識しようとするものであ
る。図１は本発明の１実施例を示す単語認識方法の説明
図であって、単語「information」の上部パタンの符号
化方法と関連データを示している。[Explanation of Embodiment] [Explanation of Configuration] The word recognition method of the present invention is not a conventional character-based recognition but a method of directly recognizing a word using a feature included in a pattern of the entire word. is there. FIG. 1 is an explanatory diagram of a word recognition method according to an embodiment of the present invention, showing an encoding method of an upper pattern of a word "information" and related data.

【００１２】本実施例では、同図（ａ）に示すように、
単語のパタンを囲む矩形枠を２本の水平線によって、上
部（頭部）パタン、中部（本体部）パタン、下部（底
部）パタンの３個の矩形領域に分割する。各矩形領域の
幅は共通のＷであり、高さは領域分割により各々Ｈ１，
Ｈ２，Ｈ３に分かれる。この内、中部パタンは、その構
成文字のパタンで満たされているが、上部パタンは、こ
の例では「ｉ」「ｆ」「ｔ」「ｉ」の４文字のみが上部
パタンに寄与するため、疎らなパタンとなる。また、下
部パタンをもつ文字は含まれていないので英単語の下部
パタン全体は空となっている。In this embodiment, as shown in FIG.
The rectangular frame surrounding the word pattern is divided by two horizontal lines into three rectangular areas of an upper (head) pattern, a middle (main body) pattern, and a lower (bottom) pattern. The width of each rectangular area is W in common, and the height is H1,
Divided into H2 and H3. Of these, the middle pattern is filled with the pattern of its constituent characters, but in this example, only the four characters "i", "f", "t", and "i" contribute to the upper pattern in this example. It becomes a sparse pattern. Also, since no character having a lower pattern is included, the entire lower pattern of an English word is empty.

【００１３】本実施例では、明瞭かつ整然と印刷された
欧文書を対象とし、その文書画像から各行が切り出さ
れ、行から各単語が切り出された後の、単語の認識を想
定している。従って、この単語認識の段階では、フォン
トの大きさは既知で、単語パタンには水平・垂直方向に
対する傾きがなく、分割線により正しく３部分に分割さ
れ得るものと仮定している。In this embodiment, a clear and orderly printed European document is targeted, and it is assumed that each line is cut out from the document image and each word is cut out from the line to recognize the word. Therefore, at the word recognition stage, it is assumed that the font size is known, the word pattern has no inclination in the horizontal and vertical directions, and the word pattern can be correctly divided into three parts by the dividing lines.

【００１４】この分割線の高さを決めるには、例えば欧
文１行分あるいは１単語分を含む画像の垂直周辺分布、
すなわち各水平線上に何個の黒画素があるかを示す分布
を取ってみれば、単語がその中部パタンにおいて大部分
の黒画素を有していることから、これは矩形分布に近い
ので、その矩形分布の両縁から上下２本の分割線が求め
られる。In order to determine the height of the dividing line, for example, the vertical marginal distribution of an image including one line or one word
In other words, taking a distribution indicating how many black pixels are on each horizontal line, since the word has most of the black pixels in its middle pattern, this is close to a rectangular distribution, so Two upper and lower dividing lines are obtained from both edges of the rectangular distribution.

【００１５】図２（ａ）はアルファベット小文字２６文
字の内、上部パタンをもつ９文字を上部パタンの形状に
より４種類に分類し、その種類を表す符号として０から
３までの数値を割り当てたものである。すなわち、文字
「ｉ」「ｊ」には０、文字「ｂ」「ｄ」「ｈ」「ｋ」
「ｌ」には１、文字「ｔ」には２、文字「ｆ」には３の
上部パタン符号を割り当てる。また、同図（ｂ）は下部
パタンをもつ５文字を下部パタンの形状により３種類に
分類し、文字下部パタン符号として０、１、２を割り当
てたものである。すなわち、文字「ｊ」「ｙ」には０、
文字「ｐ」「ｑ」には１、文字「ｇ」には２の下部パタ
ン符号を割り当てる。FIG. 2A shows nine characters having an upper pattern among the 26 lowercase letters of the alphabet, which are classified into four types according to the shape of the upper pattern, and numerical values from 0 to 3 are assigned as codes indicating the types. It is. That is, the characters “i” and “j” are 0, and the characters “b”, “d”, “h”, and “k”
An upper pattern code of 1 is assigned to “l”, 2 to character “t”, and 3 to character “f”. In FIG. 2B, five characters having a lower pattern are classified into three types according to the shape of the lower pattern, and 0, 1, and 2 are assigned as character lower pattern codes. That is, the characters "j" and "y" are 0,
Letters “p” and “q” are assigned a lower pattern code of 1 and letters “g” are assigned a lower pattern code of 2.

【００１６】そして、英単語中に現れるこれらの文字の
パタン符号の可変長の列によって英単語の上部パタン符
号と下部パタン符号を定義する。従って、英単語「info
rmation」の上部パタン符号は「ｉ」「ｆ」「ｔ」
「ｉ」の文字上部パタン符号を用いて０３２０と表さ
れる。これに文字上部パタン検出個数の４を前置して、
４０３２０を英単語上部パタン符号とする。また下部パ
タン符号はこの例では文字下部パタン検出個数の０の
みで表される。An upper pattern code and a lower pattern code of the English word are defined by a variable-length sequence of pattern codes of these characters appearing in the English word. Therefore, the English word "info
The upper pattern code of "rmation" is "i""f""t"
It is represented as 0320 using the character upper pattern code of “i”. This is prefixed with 4 for the number of character upper patterns detected,
Let 40320 be an English word upper pattern code. Also, in this example, the lower pattern code is represented by only the detected number of character lower patterns, 0.

【００１７】図１（ａ）では、４個の文字上部パタンの
水平位置を、英単語の左端からの距離を用いてｐ１、ｐ
２、ｐ３、ｐ４と表している。これらの値は文字フォン
トの大きさと印字間隔のばらつきによって変動しうる
が、単語内でのおおよその相対位置の範囲は限定でき
る。In FIG. 1 (a), the horizontal positions of the four character upper patterns are represented by p1, p using the distance from the left end of the English word.
They are represented as 2, p3 and p4. These values may vary depending on the size of the character font and variations in the print interval, but the range of the approximate relative position within the word can be limited.

【００１８】図１（ｂ）には、以上のべた英単語「info
rmation」の上部パタンを中心とする特徴を、英単語認
識処理に利用可能なデータとして表にまとめたものであ
る。このようなデータを認識対象の全ての英単語につい
て作成し、このデータを利用して、英単語認識用辞書を
作成する。FIG. 1B shows the above-mentioned solid English word “info”.
The main features of the upper pattern of "rmation" are summarized in a table as data that can be used for English word recognition processing. Such data is created for all English words to be recognized, and an English word recognition dictionary is created using this data.

【００１９】本実施例の英単語認識用辞書は、英単語の
上部パタン符号及び下部パタン符号によって索引して１
個以上の英単語候補を求め、その辞書の記載内容から英
単語候補の上部・下部パタン符号の検出位置による判別
を行うか、または一部の構成文字のパタン照合による識
別を行い、その結果より英単語を決定し、最終的に英単
語の識別コードまたは構成文字列を得る。The English word recognition dictionary according to the present embodiment is indexed by upper and lower pattern codes of English words to obtain one.
More than one English word candidate is determined, and from the contents of the dictionary, the upper or lower pattern code of the English word candidate is identified based on the detection position, or some constituent characters are identified by pattern matching, and the The English word is determined, and the identification code or constituent character string of the English word is finally obtained.

【００２０】最後のパタン照合は従来方式による文字単
位の認識を伴うものであるであるが、これは上部・下部
パタンが少なく英単語候補が多い場合や該当単語がない
場合に使用される。The last pattern collation involves recognition in character units by the conventional method, and is used when there are few upper / lower patterns and many English word candidates or when there is no corresponding word.

【００２１】［動作の説明］図３は英単語の上部パ
タンのみを用いた英単語認識のフローチャートである。
下部パタンの処理も同時に行う必要があるが、上部パタ
ンと処理方法が同じであるから、簡単のためそのフロー
チャートは省略する。なお、英単語識別用辞書はすでに
作成してあるものとし、ここでは未知の英単語の文字パ
タンが入力された場合の認識処理を考える。[Explanation of Operation] FIG. 3 is a flowchart of English word recognition using only the upper pattern of English words.
The processing of the lower pattern must also be performed at the same time, but since the processing method is the same as that of the upper pattern, its flowchart is omitted for simplicity. It is assumed that an English word identification dictionary has already been created, and here, a recognition process in the case where a character pattern of an unknown English word is input will be considered.

【００２２】処理S101で英単語の上部パタンを切り出
す。処理S102では、英単語の上部パタン全体の水平周辺
分布を求める。図４（ａ）に「information」の上部パ
タン（点線の左側部分）の水平周辺分布を示す。これは
英単語上部パタンの各水平位置で垂直方向に何個の黒画
素があるかを示す。この例では上部パタンをもつ文字
「ｉ」「ｆ」「ｔ」「ｉ」の水平位置に黒画素があるの
で、４個所に文字上部パタンの水平周辺分布が現れる。In step S101, an upper pattern of an English word is cut out. In process S102, the horizontal peripheral distribution of the entire upper pattern of the English word is obtained. FIG. 4A shows the horizontal peripheral distribution of the upper pattern (the left part of the dotted line) of “information”. This indicates how many black pixels are in the vertical direction at each horizontal position of the English word upper pattern. In this example, since there are black pixels at the horizontal positions of the characters "i", "f", "t", and "i" having the upper pattern, the horizontal peripheral distribution of the character upper pattern appears at four locations.

【００２３】処理S103では、水平周辺分布を左から右に
向かって連続する非零区間として文字上部パタンを検出
する。各々の文字上部パタンに対して、処理S104でその
局所的な垂直周辺分布を求める。これは１個の文字上部
パタンの各垂直位置で水平方向に何個の黒画素があるか
を計数することによって求められる。図４（ａ）にこう
して求めた４個の局所的垂直周辺分布を示す。In step S103, the character upper part pattern is detected as a non-zero section in which the horizontal peripheral distribution is continuous from left to right. For each character upper pattern, a local vertical marginal distribution is obtained in step S104. This is obtained by counting the number of black pixels in the horizontal direction at each vertical position of one character upper pattern. FIG. 4A shows the four local vertical marginal distributions thus obtained.

【００２４】処理S105では、水平周辺分布と局所的垂直
周辺分布の形状から文字上部パタンの種類（符号）を決
定し、合わせてその文字上部パタンの検出位置（代表点
の水平座標）を求める。図２より文字上部パタンは４種
類しかないので、これらは容易に判別できる。図４
（ａ）の点線の右側に文字「ｂ」の上部パタンの両周辺
分布を示す。In step S105, the type (sign) of the character upper pattern is determined from the shapes of the horizontal peripheral distribution and the local vertical peripheral distribution, and the detection position of the character upper pattern (horizontal coordinates of the representative point) is determined. Since there are only four types of character upper patterns from FIG. 2, these can be easily discriminated. FIG.
The both peripheral distributions of the upper pattern of the letter “b” are shown on the right side of the dotted line in FIG.

【００２５】上部パタンの典型的な判別方法は、水平周
辺分布が最大値Ｈ１付近に達するかどうかで、達しなけ
ればパタン符号が「０」または「２」、達すれば「１」
または「３」であり、垂直周辺分布が下部付近に零区間
（黒点がない区間）をもてばパタン符号を「０」とす
る。また、パタン符号が「１」と「３」では、水平周辺
分布が最大値をとる水平位置が非零区間の左端付近にあ
れば「３」とする。A typical method of discriminating the upper pattern is based on whether the horizontal marginal distribution reaches the vicinity of the maximum value H1. If not, the pattern code is "0" or "2".
Or, it is “3”, and the pattern code is set to “0” if the vertical marginal distribution has a zero section (a section without a black point) near the lower part. When the pattern codes are “1” and “3”, if the horizontal position at which the horizontal marginal distribution takes the maximum value is near the left end of the non-zero section, it is set to “3”.

【００２６】英小文字「ｆ」はその右側の文字と重なっ
て印刷されることもあるが、その場合の判別方法は、左
端の「ｆ」の文字パタンが既知としてそれを除去して黒
点を白点に変え、残りの文字上部パタンの両周辺分布を
再度求めて判別すればよい。The lowercase letter "f" may be printed so as to overlap with the character on the right side. In such a case, the method of discrimination is as follows. Then, both peripheral distributions of the remaining character upper pattern may be obtained again and discriminated.

【００２７】処理S106では、以上の文字上部パタン符号
をその個数と共に組み合わせた英単語の上部パタン符号
を用いて英単語認識用辞書を索引し、1個以上の候補英
単語を選定する。実際には、S101からS105と同じ手順で
下部パタン符号を求め、S106ではこれら両パタン符号を
用いて英単語を絞り込むが、前述したように、下部パタ
ンの説明は省略している。図４（ｂ）に英単語「jumpin
g」に含まれる３種の下部パタンとその周辺分布を示
す。下部パタンの判別方法は、例えば水平分布のパタン
幅や増減情報を用いる。In step S106, the English word recognition dictionary is indexed using the upper pattern code of the English word obtained by combining the above character upper pattern codes together with the number thereof, and one or more candidate English words are selected. Actually, lower pattern codes are obtained in the same procedure as S101 to S105, and in S106, English words are narrowed down using both of these pattern codes. However, as described above, the description of the lower patterns is omitted. FIG. 4B shows the English word “jumpin”.
"g" shows three types of lower patterns and their peripheral distributions. The method of determining the lower pattern uses, for example, a horizontal distribution pattern width and increase / decrease information.

【００２８】図５は、英単語認識用辞書の索引方法を示
す。英単語全体の集合は、文字上部パタンをもたないも
のが全体で１分類、1個だけ持つものが４分類、２個持
つものが１６分類、3個が６４分類、4個が２５６分類、
5個が１０２４分類に分類されている。６個以上の文字
上部パタンを持つ英単語は最初の6個のみを用いて４０
９６個に分類する。FIG. 5 shows an indexing method of the English word recognition dictionary. The set of all English words is one that has no upper character pattern in total, one that has only one character has four classifications, one that has two characters has sixteen classifications, three has 64 classifications, four has 256 classifications,
Five are classified into 1024 classifications. English words with 6 or more letter upper patterns are 40 words using only the first 6 words.
Classify into 96.

【００２９】文字上部パタンが無い英単語は「a」「ac
e」から始まる極めて多数の英単語からなる。英単語「i
nformation」は上部パタン符号が「40320」である分類
の中に、計3個の英単語候補「infraction」「infectio
n」「information」の１つとして存在する。English words without a character upper pattern are "a" and "ac".
It consists of a large number of English words starting with "e". The English word `` i
"nformation" is a class with the upper pattern code of "40320", and a total of three English word candidates "infraction" and "infectio"
n ”and“ information ”.

【００３０】次に、この3個の候補英単語の間の識別方
法を図６に示す。同図（ａ）は特定の上部パタンの検出
位置による識別方法を、同図（ｂ）は構成文字を1文字
だけ認識処理し、その結果により判別する方法を示す。Next, FIG. 6 shows a method of identifying between the three candidate English words. FIG. 7A shows a method of identifying a specific upper pattern based on the detected position, and FIG. 7B shows a method of recognizing only one constituent character and determining the result based on the result.

【００３１】同図（ａ）の識別方法では、辞書には次の
識別方法が指示されている。２番目の文字上部パタンの
検出位置ｐ２と横幅ｗとの比を求め、 3/16 ＜ｐ２／Ｗ＜ 5/16ならば「infarction」 2/8 ＜ｐ２／Ｗ＜ 3/8 ならば「infection」 1/8 ＜ｐ２／Ｗ＜ 2/8 ならば「information」In the identification method shown in FIG. 2A, the following identification method is specified in the dictionary. The ratio between the detection position p2 of the second character upper pattern and the width w is calculated, and if 3/16 <p2 / W <5/16, “infarction” 2/8 <p2 / W <3/8, “infection” If 1/8 <p2 / W <2/8, "information"

【００３２】これは文字「ｆ」の英単語内での相対位置
から判別しようとするものである。しかし、この3個の
許容範囲には重複があるので、この例では完全に識別で
きない。This is to determine from the relative position of the letter "f" in the English word. However, the three tolerances overlap and cannot be completely identified in this example.

【００３３】また、同図（ｂ）の識別方法では、辞書に
は次の識別方法が指示されている。2番目の文字上部パ
タンの次の文字を認識し、その文字が「o」ならば「information」「e」ならば「infection」「a」ならば「infarction」In the identification method shown in FIG. 3B, the dictionary specifies the following identification method. Recognizes the next character in the upper pattern of the second character. If that character is "o", it is "information". "E" is "infection". "A" is "infarction"

【００３４】これは「ｆ」の位置では判別できないの
で、「ｆ」の次の文字を従来方法で認識し、その文字に
よって判別しようとするもので、この例では完全に識別
できる。Since this cannot be distinguished at the position of "f", the character next to "f" is recognized by the conventional method and is to be determined by that character. In this example, the character can be completely identified.

【００３５】図３のフローチャートに戻ると、処理S107
と処理S108では、1個の分類の中の全ての英単語候補に
ついて、そこで指定された識別方法の指示に従って、入
力された英単語の上部パタンの検出位置や文字認識結果
を求め、それが辞書に記載された条件に適合するかどう
かを検査する。処理S109では、その結果から目的の英単
語を決定する。Returning to the flowchart of FIG. 3, processing S107 is performed.
In step S108, for all the English word candidates in one classification, the detection position of the upper pattern of the input English word and the character recognition result are obtained in accordance with the instruction of the identification method specified there, and the dictionary is obtained. Check whether the conditions described in the above are met. In processing S109, a target English word is determined from the result.

【００３６】［利用形態の説明］文字単位の認識方
法は任意に選択できるので、本発明の方法を既存の文字
認識システムに統合する利用形態が考えられる。本発明
は英文認識を高速化するので、書物のデータベース化に
利用したり、テキスト音声変換系や通信系へ接続した利
用形態が考えられる。[Explanation of Usage Form] Since a recognition method for each character can be arbitrarily selected, a usage form in which the method of the present invention is integrated into an existing character recognition system is considered. Since the present invention speeds up the English sentence recognition, it can be used for creating a book database or connected to a text-to-speech conversion system or a communication system.

【００３７】[0037]

【発明の効果】以上のように、本発明の英単語認識方法
によれば、印刷体の欧文文字を構成要素とする単語パタ
ンを２本の水平分割線によって上部パタン、中部パタ
ン、下部パタンの３領域に分割し、前記上部パタンの水
平周辺分布及び局所的垂直周辺分布より、前記欧文文字
の文字上部パタンを形状により複数種類に分類して全て
検出し、当該上部パタンの検出個数及び当該種類を表す
符号を当該検出位置の順序で並べた単語上部パタン符号
を求め、前記下部パタンの水平周辺分布及び局所的垂直
周辺分布より、前記欧文文字の文字下部パタンを形状に
より複数種類に分類して全て検出し、当該下部パタンの
検出個数及び当該種類を表す符号を当該検出位置の順序
で並べた単語下部パタン符号を求め、前記単語上部パタ
ン符号及び単語下部パタン符号を用いて、認識対象の単
語を分類した単語認識用辞書を予め作成し、入力された
単語パタンより前記単語の上部及び下部パタン符号を求
め、当該両パタン符号により前記単語認識用辞書を索引
して１個以上の単語候補を選定するようにし、上部パタ
ン、下部パタンという、英単語の本体から上下にはみ出
した部分を検出して、英単語全体を認識するようにした
ので、従来方法のように文字を個別に切り出して認識す
るという手順が省略できる。As described above, according to the English word recognition method of the present invention, a word pattern having European characters on a printed body as a component is divided into two parts by an upper pattern, a middle pattern, and a lower pattern. The upper pattern is divided into three regions, and from the horizontal peripheral distribution and the local vertical peripheral distribution of the upper pattern, the character upper patterns of the European characters are classified into a plurality of types according to their shapes and all are detected, and the number of detected upper patterns and the relevant type are detected. Is obtained by arranging the codes representing the characters in the order of the detection positions, and from the horizontal peripheral distribution and the local vertical peripheral distribution of the lower pattern, the character lower patterns of the European characters are classified into a plurality of types by shape. All the words are detected, and a code indicating the number of detections and the type of the lower pattern are arranged in the order of the detection positions to obtain a word lower pattern code, and the word upper pattern code and the word lower pattern are obtained. Using a tan code, a word recognition dictionary in which the words to be recognized are classified is created in advance, the upper and lower pattern codes of the word are obtained from the input word pattern, and the word recognition dictionary is obtained by the two pattern codes. A conventional method is used in which one or more word candidates are selected by indexing, and upper and lower patterns, that is, portions of the English word that protrude from the body of the English word are detected and the entire English word is recognized. The procedure of cutting out and recognizing characters individually as in the above can be omitted.

【００３８】また、上部・下部パタンは、水平周辺分布
と局所的垂直周辺分布の解析で容易かつ一意的に求めら
れ、英単語認識用辞書全体が、これらパタン符号によっ
て重複なく分類されるので、パタン照合でなく索引によ
って英単語候補が絞り込める。The upper and lower patterns can be easily and uniquely obtained by analyzing the horizontal peripheral distribution and the local vertical peripheral distribution, and the entire English word recognition dictionary is classified without duplication by these pattern codes. English word candidates can be narrowed down by indexing instead of pattern matching.

【００３９】更に、複数の英単語候補がある場合、上部
パタン・下部パタンの検出位置などの特徴抽出で得られ
たデータにより判別するので、最小限の処理で済む。英
文字の認識を必要とする場合でも、全文字でなく特定の
少数の文字のみ認識すればよい。また、英単語候補が多
数ある場合には、特定の文字の認識結果やその他の情報
を使ってさらに詳細な分類を行うこともできる。Further, when there are a plurality of English word candidates, the determination is made based on the data obtained by the feature extraction such as the detection positions of the upper pattern and the lower pattern, so that the minimum processing is required. Even when English characters need to be recognized, only a few specific characters need to be recognized instead of all characters. When there are many English word candidates, more detailed classification can be performed using the recognition result of a specific character or other information.

【００４０】以上述べたことから、本発明により認識処
理を従来方法より大幅に効率化し、高速化することがで
きる。As described above, according to the present invention, the recognition processing can be made much more efficient and faster than the conventional method.

【００４１】尚、文字上下パタンの判別方法をわずかに
変更すれば、本発明の方法は殆どの書体に適用できる。
例えば、イタリック（傾斜）体では傾斜補正や「ｆ」の
下部へのはみ出しを無視する等で対処できる。但しこの
場合「ｔ」の上部へのはみ出しが小さい書体は、上部パ
タンを精度良く切り出す必要がある。It should be noted that the method of the present invention can be applied to most typefaces by slightly changing the method of determining the character upper / lower pattern.
For example, in an italic (tilted) body, it can be dealt with by correcting the tilt or ignoring the protrusion below the "f". However, in this case, it is necessary to cut out the upper pattern of a typeface with a small protrusion to the upper part of “t”.

【００４２】上下パタン符号は、人間の視覚認識に整合
する分かりやすい表現であるから、誤読の要因解析が容
易で、認識率を向上しやすい。Since the upper and lower pattern codes are easy-to-understand expressions that match the human visual recognition, it is easy to analyze the cause of erroneous reading and improve the recognition rate.

【００４３】また、上下パタン符号は、候補文字の識別
符号と合わせても、２０ビット程度に収められるので、
英単語の可変長または固定長コードとして使用しても十
分に実用的であり、英文テキスト、文法辞書、イディオ
ム辞書などで、英単語表現に利用できる。Also, since the upper and lower pattern codes are included in about 20 bits even when combined with the identification codes of the candidate characters,
It is sufficiently practical to be used as a variable length or fixed length code for English words, and can be used for English word expression in English texts, grammar dictionaries, idiom dictionaries, and the like.

[Brief description of the drawings]

【図１】本発明の英単語認識方法の説明図である。FIG. 1 is an explanatory diagram of an English word recognition method of the present invention.

【図２】文字の上部・下部パタンの符号化の説明図であ
る。FIG. 2 is an explanatory diagram of encoding of upper and lower patterns of a character.

【図３】上部パタンによる英単語認識フローチャートで
ある。FIG. 3 is an English word recognition flowchart using an upper pattern.

【図４】上部・下部パタンの周辺分布の説明図である。FIG. 4 is an explanatory diagram of a peripheral distribution of upper and lower patterns.

【図５】辞書の索引方法の説明図である。FIG. 5 is an explanatory diagram of a dictionary indexing method.

【図６】英単語候補間の識別方法の説明図である。FIG. 6 is an explanatory diagram of a method of identifying between English word candidates.

【図７】従来の文字認識装置の構成図である。FIG. 7 is a configuration diagram of a conventional character recognition device.

【図８】サブパタンの一例を示す図である。FIG. 8 is a diagram illustrating an example of a sub-pattern.

【図９】特徴マトリクスの一例を示す図である。FIG. 9 is a diagram illustrating an example of a feature matrix.

[Explanation of symbols]

Ｗ１単語の矩形領域幅Ｐ１、Ｐ２，Ｐ３，Ｐ４上部パタンの水平方向
位置Ｈ１上部パタン高さＨ２中部パタン高さＨ３下部パタン高さW 1 word rectangular area width P1, P2, P3, P4 Horizontal position of upper pattern H1 Upper pattern height H2 Middle pattern height H3 Lower pattern height

Claims

[Claims]

1. A word pattern having European characters on a printed body as constituent elements is divided into three areas of an upper pattern, a middle pattern, and a lower pattern by two horizontal dividing lines, and a horizontal peripheral distribution of the upper pattern and a local pattern are locally divided. From the vertical peripheral distribution, the upper character patterns of the European characters are classified into a plurality of types according to their shapes, all are detected, and the number of detected upper patterns and codes representing the types are arranged in the order of the detected positions. From the horizontal peripheral distribution and the local vertical peripheral distribution of the lower pattern, character lower patterns of the European characters are classified into a plurality of types according to their shapes and all are detected, and the number of detected lower patterns and a code representing the type are detected. Are determined in the order of the detection positions to obtain a word lower pattern code, and the word to be recognized is simply classified using the word upper pattern code and the word lower pattern code. A word recognition dictionary is created in advance, upper and lower pattern codes of the word are obtained from the input word pattern, and the word recognition dictionary is indexed with the two pattern codes to select one or more word candidates. A word recognition method characterized by the following.

2. An English word pattern composed of 26 characters from lowercase letters "a" to "z" of a printed material is divided into three areas of an upper pattern, a middle pattern, and a lower pattern by two horizontal dividing lines. From the horizontal peripheral distribution and the local vertical peripheral distribution of the upper pattern, the upper part of “i” and “j”, “b”, “d”, “h”, and “k”
All upper character patterns classified into at least four types: upper part of "l", upper part of "t", and upper part of "f" are detected, and the number of detected upper patterns and the code indicating the type are determined in the order of the detected positions. From the horizontal peripheral distribution and the local vertical peripheral distribution of the lower pattern, the lower part of "j" and "y", the lower part of "p" and "q"
Detecting all lower character patterns classified into at least three types below "g" and obtaining an English word lower pattern code in which the number of detected lower patterns and codes representing the types are arranged in the order of the detection positions, Using an English word upper pattern code and an English word lower pattern code, an English word recognition dictionary in which English words to be recognized are classified is created in advance, and upper and lower pattern codes of the English word are input from the input English word pattern. A method for recognizing an English word, comprising: searching for the English word recognition dictionary using the two pattern codes to select one or more English word candidates.

3. An identification condition relating to the detection position of the upper character pattern and the lower character pattern or an identification condition relating to a constituent character of an English word is recorded in advance in the English word recognition dictionary as identification data of an English word candidate. Then, from among the selected English word candidates of the input English word pattern, an English word that meets the identification condition for the detection position or the identification condition for the constituent characters of the English word is selected. 3. The method for recognizing English words according to claim 2.