JPS62133585A - Word segmenting system - Google Patents
Word segmenting systemInfo
- Publication number
- JPS62133585A JPS62133585A JP60274051A JP27405185A JPS62133585A JP S62133585 A JPS62133585 A JP S62133585A JP 60274051 A JP60274051 A JP 60274051A JP 27405185 A JP27405185 A JP 27405185A JP S62133585 A JPS62133585 A JP S62133585A
- Authority
- JP
- Japan
- Prior art keywords
- word
- gap
- circuit
- image information
- projection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 claims description 21
- 238000005520 cutting process Methods 0.000 claims description 6
- 238000000034 method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Landscapes
- Character Input (AREA)
Abstract
Description
【発明の詳細な説明】
〔技術分野〕
本発明は、欧文原稿などの読取画像情報から単語を切り
出すための11を語切出方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field] The present invention relates to a word extraction method 11 for cutting out words from read image information of a Roman manuscript or the like.
文書処理を行う装置類において、欧文などを処理する場
合1文字li位の処理だけでなく、単語!1を位の処理
が必要になることが多い。In document processing devices, when processing Western languages, it is not only necessary to process single characters, but also words! Processing of the ones place is often necessary.
例えば欧文を処理するOCRにおいては、個々の文字を
認識するだけではなく、認識した文字の集まりである単
語を単語辞書と比較することにより、認識エラーを修正
ないし防+fz したり、あるいは、認識不可能な文字
を推定するなどの単語単位の処理を行うことがある。こ
の場合に、欧文原稿の読取画像情報から文章行を切り出
し、さらに文字を切り出すだけではなく、単J5の切出
も必要である。For example, in OCR that processes Roman characters, it not only recognizes individual characters, but also corrects or prevents recognition errors by comparing words, which are collections of recognized characters, with a word dictionary. May perform word-by-word processing, such as estimating possible characters. In this case, it is necessary not only to cut out text lines and characters from the read image information of the European manuscript, but also to cut out J5.
同様の1′L語切出処理は、欧文を邦文に翻訳するコン
ピュータ翻訳システムにおいては不可欠である。Similar 1'L word extraction processing is essential in computer translation systems that translate European texts into Japanese texts.
従来、そのような単語切出処理は、読取画像から切り出
された文章行の画像情報の垂直射影をとり、その射影の
切れ[1の幅榮予め設定された固定のQt語間ギャップ
の判定閾値と比較し、その判定閾値以上の幅の射影の切
れ[1を単語の切れ目とみなし1文章行画像情報から1
11語を切り出している。Conventionally, such word extraction processing takes a vertical projection of the image information of a line of text extracted from a read image, and then calculates the cut of the projection by using a fixed Qt interword gap judgment threshold set in advance. , the projection break with a width greater than the judgment threshold [1 is regarded as a word break, and 1 sentence line image information to 1
11 words have been extracted.
しかし、欧文雑誌などはプロポーション・ピッチで印刷
されているのが一般的であって、文字間ギャップ幅およ
び単語間ギャップ幅は一定しておらず、一定の閾値では
単語間ギャップの判定エラーが起きやすく、その結果、
単語の切出エラーが起きやすい。However, European magazines are generally printed with proportion pitch, and the gap widths between characters and words are not constant, and if a certain threshold is used, an error in determining the gap between words may occur. As a result,
Word segmentation errors are likely to occur.
また、必ずしもプロポーショナル・ピッチ印刷の欧文原
稿でなくとも、一定の判定閾値を用いる方法では、単語
の切出エラーが頻繁に起こる場合がある。例えば異サイ
ズの文字が混在している欧文原稿の場合がそうである。Further, even if the manuscript is not necessarily a European manuscript printed with a proportional pitch, a method using a fixed determination threshold may frequently cause word extraction errors. This is the case, for example, in the case of a European manuscript in which characters of different sizes are mixed.
つまり、大きな文字サイズで印刷された文字間キャップ
の幅が小さな文字サイズで印刷された単語間ギャップの
幅より大きくなることがあるため、従来のように一定の
判定閾値を用いたのでは!lj−nri間ギャップを正
しく判定できずに単語の切出エラーが起こる確率が高℃
)。In other words, the width of the inter-character caps printed in large font sizes may be larger than the width of inter-word gaps printed in small font sizes, so why not use a fixed judgment threshold as in the past? There is a high probability that a word segmentation error will occur due to the inability to correctly determine the lj-nri gap.
).
本発明の目的は、プロポーショナル・ピッチ印刷の欧文
原稿、人文字サイズの文字が混在しているような欧文原
稿などから読み取られた画像情報から、単語を確実に切
り出すための単語切出方式を提供することにある。An object of the present invention is to provide a word extraction method for reliably extracting words from image information read from a European manuscript printed with proportional pitch printing, a European manuscript containing human-sized characters, etc. It's about doing.
この目的を達成すべくなされた本発明の単語切出方式は
、文章行の画像情報の垂直方向の射影をとり、その射影
の切れ目の幅のヒストグラムを作成し、そのピークに対
応する最大の幅に従って単語間ギャップの判定閾値を決
定し、その判定閾値以上の幅の射影の切れ目を単語間の
ギャップと判定し、文章行の画像情報から単語を切り出
すことを特徴とするものである。The word extraction method of the present invention, which was made to achieve this purpose, takes a vertical projection of the image information of a text line, creates a histogram of the widths of the projection breaks, and calculates the maximum width corresponding to the peak. According to this method, a threshold for determining an inter-word gap is determined, a projection break having a width equal to or greater than the determination threshold is determined to be a gap between words, and a word is extracted from image information of a text line.
以下、本発明の一実施例について図面を参照し説明する
。An embodiment of the present invention will be described below with reference to the drawings.
第1図は本発明の単語切出方式の適用された文書処理装
置の要部の構成を示す概略ブロック図である。同図にお
いて、10は文書原稿(例えば英文原稿)を読み取るた
めのスキャナであり、このスキャナ10により読み取ら
れた文書原稿の画像情報は画像バッファ12に蓄積され
る。この画像バッファ12より画像情報は行切出部14
に順次入力され、文章行の画像情報が切り出される。FIG. 1 is a schematic block diagram showing the configuration of the main parts of a document processing device to which the word extraction method of the present invention is applied. In the figure, reference numeral 10 denotes a scanner for reading a document (for example, an English document), and image information of the document read by this scanner 10 is stored in an image buffer 12. The image information from this image buffer 12 is extracted by the line cutting unit 14.
are input sequentially into the text line, and the image information of the text line is cut out.
この行切出処理は、例えば水平方向(行方向)の射影を
求め、その射影の谷と谷の間を切り出すという一般的な
射影法によって行われる。勿論。This line cutting process is performed, for example, by a general projection method in which a horizontal direction (line direction) projection is obtained and the valleys of the projection are cut out. Of course.
部分領域毎の射影をとる改良型射影法によって行切男し
を行ってもよい。切り出された文章行の画像情報は行バ
ッファ16に一時的に蓄積され、後段の単語切出部18
に入力される。Gyokiri Oshishi may be performed using an improved projection method that takes projections for each partial region. The image information of the cut out sentence lines is temporarily stored in the line buffer 16, and then sent to the word cutting unit 18 in the subsequent stage.
is input.
このη1語切出部18は垂直射影抽出回路20、ギャッ
プヒスドグ911作成回路22、’l’−l!n間ギャ
ップ判定閾値決定回路24および単語切出回路26から
なっている。This η1 word extraction unit 18 includes a vertical projection extraction circuit 20, a gap hisdog 911 creation circuit 22, 'l'-l! It consists of an n-gap judgment threshold determining circuit 24 and a word cutting circuit 26.
垂直射影抽出回路20は行バッファ16から人力された
文章行画像情報の垂直方向(文章行に対して直角の方向
)の射影を求める回路である。この射影の連続した部分
は文字の範囲に対応し、射影の切れ「1(山部)は文字
と文字の間のギャップまたは111語と111、語の間
のギャップである。The vertical projection extraction circuit 20 is a circuit that obtains a projection of the text line image information input manually from the line buffer 16 in the vertical direction (direction perpendicular to the text line). The continuous part of this projection corresponds to the range of characters, and the break "1 (mountain part)" of the projection is the gap between characters or the gap between 111 words and 111 words.
前述のように、プロポーショナル・ピッチ印刷の欧文原
稿などにあっては、文字間ギャップおよび単語間ギャッ
プが大輪に変動するため、単語1jlTギヤツプを識別
するための判定閾値を文章行毎に適切に設定する必要が
ある。そのための回路がギャップヒストグラム作成回路
22および単語間ギャップ判定閾値決定回路24である
。As mentioned above, in European manuscripts with proportional pitch printing, the inter-character gaps and inter-word gaps fluctuate widely, so it is necessary to appropriately set the judgment threshold for identifying word 1jlT gaps for each line of text. There is a need to. The circuits for this purpose are a gap histogram creation circuit 22 and an inter-word gap determination threshold determination circuit 24.
ギャップヒストグラム作成回路22は垂直射影抽出回路
20より一つの文章行の垂直射影の情報を人力され、そ
の垂直射影の切れ目(ギャップ)の幅の頻度を計数して
ギャップヒストグラムを作成する。例えば第2図に示す
ような欧文原稿の画像情報が入力された場合を想定する
。この図において、LL、L2.L3はそれぞれ文章行
を意味し、斜線を施した範囲がそれぞれ単語を意味し、
単語の間の空白部は!1を語間のギャップを意味する。The gap histogram creation circuit 22 receives information on the vertical projection of one text line from the vertical projection extraction circuit 20, counts the frequency of the width of the gap in the vertical projection, and creates a gap histogram. For example, assume that image information of a Roman manuscript as shown in FIG. 2 is input. In this figure, LL, L2. Each L3 means a line of text, each shaded area means a word,
The spaces between words! 1 means the gap between words.
いまL2の文章行の画像情報が単語切71′j部18に
入力された場合、第73図に示すようなギャップヒス1
−グラムが得られる。このギャップヒストグラム中のA
の範囲は文字間ギャップに対応し、またBの範囲は単語
間ギャップに対応している。If the image information of the text line of L2 is now input to the word cutter 71'j part 18, the gap histogram 1 as shown in FIG.
- grams are obtained. A in this gap histogram
The range B corresponds to the inter-character gap, and the range B corresponds to the inter-word gap.
単語間ギャップ判定閾値決定回路24は、そのようなギ
ャップヒストグラムの情報を入力され、そのギャップヒ
ストグラムのピークに対応する最大の幅(Gm)を検出
する。但し、そのピークは所定値(例えば2)以上の頻
度値のものとする。The inter-word gap determination threshold determination circuit 24 receives information on such a gap histogram and detects the maximum width (Gm) corresponding to the peak of the gap histogram. However, the peak has a frequency value of a predetermined value (for example, 2) or more.
例えば第3図に示すギャップヒストグラムの場合、範囲
Bのピークに対応するギャップ幅がGmとして検出され
る。For example, in the case of the gap histogram shown in FIG. 3, the gap width corresponding to the peak in range B is detected as Gm.
そして単語間ギャップ判定閾値決定回路24は、単語間
ギャップ判定閾値Gtを式
%式%
(こ\でGδはO<Gδ(Gmの定数)によって算定し
、その単d11間ギャップ判定閾値G1、の情報を単語
切出回路26に送る。Then, the inter-word gap judgment threshold determination circuit 24 calculates the inter-word gap judgment threshold Gt by the formula % (where Gδ is O<Gδ (constant of Gm)), and calculates the inter-word gap judgment threshold G1, The information is sent to the word extraction circuit 26.
11語り出回路26は、垂直射影抽出回路20から入力
される垂直射影の情報から単、!ri間ギャップ判定閾
値以上の幅の垂直射影の切れ目を単語間ギャップとして
識別し1文章行画像情報から単語間ギャップによって区
切られた単語の画像情報を切り出す。11 The narration circuit 26 extracts ! from the vertical projection information inputted from the vertical projection extraction circuit 20. A break in the vertical projection with a width equal to or greater than an inter-ri gap determination threshold is identified as an inter-word gap, and image information of words separated by the inter-word gap is extracted from one sentence line image information.
例えば、垂直射影が第4図のようになる二つの文章行の
画像情報が単語切出部18に順次入力されたとする。図
中、斜線を施した範囲は垂直射影の連続している範囲で
あり、それぞれ1個ないし複数個の文字の列または単語
に相当する。プロポーショナル・ピッチ印刷の場合には
1文章行の垂直射影はこの例のようにギャップ幅が変動
する。For example, assume that image information of two text lines whose vertical projections are as shown in FIG. 4 are sequentially input to the word extraction section 18. In the figure, the hatched range is a continuous range of vertical projection, and each corresponds to one or more character strings or words. In the case of proportional pitch printing, the gap width in the vertical projection of one text line varies as shown in this example.
このような文章行では、矢印↑を付した垂直射影の切れ
Llが単語間ギャップと判定され、単語が切り出される
。In such a sentence line, a vertical projection cut Ll with an arrow ↑ is determined to be an inter-word gap, and a word is cut out.
このように、文章行の垂直射影のギャップヒストグラム
に基づき単語間ギャップ判定閾値がダイナミックに決定
されるため、プロポーショナル・ピッチ印刷の文書、異
なったサイズの文字が混在した文書の場合にも、単語間
キャップを間違いなく識別した正確に単;(6を切り出
すことができる。In this way, the inter-word gap judgment threshold is dynamically determined based on the gap histogram of the vertical projection of text lines, so even in the case of proportional pitch printed documents or documents with characters of different sizes, it is possible to It is possible to cut out the exact number (6) that has definitely identified the cap.
なお、単語切出部[8の機能はマイクロプロセッサなど
を用いてソフ1へウェア処理により実現してもよい。こ
れ以外にも1本発明は種々変形して実施しつるものであ
る。Note that the function of the word extraction unit [8 may be realized by software processing in the software 1 using a microprocessor or the like. In addition to this, the present invention can be implemented with various modifications.
以」二の詳細な説明から明らかなように、本発明は、文
章行の垂直射影のギャップヒストグラムに基づき単語間
ギャップ判定閾値をダイナミックに決定して単語切出を
行うため、プロポーショナル・ピッチ印刷の文書、異な
ったサイズの文字が4A在した文書の場合にも正確に単
語を切り出すことができる。As is clear from the following detailed description, the present invention dynamically determines the gap judgment threshold between words based on the gap histogram of the vertical projection of text lines to cut out words. Words can be accurately extracted even in a document containing 4A characters of different sizes.
第1図は本発明の単語切出方式を適用した文書処理装置
の要部構成のみを示す概略ブロック図、第2図はギャッ
プヒス1−グラムの説明のための文章行の一例を示す図
、第3図はギャップヒストグラムの一例を示す図、第4
図は文章行の垂直射影とその単語間ギャップを対照させ
て示す図である。
10・・・スキャナ、 12・・・画像バッファ、1
4・・行切出部、 16・・行バッファ、18・、単
語切出部、 2o・・・11峠直射影抽出回路、22
・・・ギャップヒストグラム作成回路、24・・・単語
1j11ギャップ判定闇値決定回路、26・・111語
切出回路。
代理人弁理士 鈴 木 誠
第2図
四コロロロ=コーし1
図コロ=コロニーL2
0コーロコロe−L3
第3図
縛
度
第4図
↑ ↑↑ ↑FIG. 1 is a schematic block diagram showing only the essential configuration of a document processing device to which the word extraction method of the present invention is applied; FIG. 2 is a diagram showing an example of a text line for explaining a gap histogram; Figure 3 shows an example of a gap histogram, and Figure 4 shows an example of a gap histogram.
The figure shows a contrast between the vertical projection of a line of text and its inter-word gaps. 10...Scanner, 12...Image buffer, 1
4. Line extraction unit, 16. Line buffer, 18. Word extraction unit, 2o... 11 Pass direct projection extraction circuit, 22
. . . Gap histogram creation circuit, 24 . . . Word 1j11 gap judgment dark value determination circuit, 26 . . 111 word extraction circuit. Representative Patent Attorney Makoto Suzuki Figure 2 4 Korororo = Koshi 1 Figure Coro = Colony L2 0 Kororo Coro e-L3 Figure 3 Restriction Figure 4 ↑ ↑↑ ↑
Claims (1)
を含む処理を行う文書処理装置において、文章行の画像
情報の垂直方向の射影をとり、その射影の切れ目の幅の
ヒストグラムを作成し、そのピークに対応する最大の幅
に従って単語間ギャップの判定閾値を決定し、その判定
閾値以上の幅の射影の切れ目を単語間のギャップと判定
し、文章行の画像情報から単語を切り出すことを特徴と
する単語切出方式。(1) In a document processing device that inputs image information of a document and performs processing including cutting out words from the document, a vertical projection of the image information of a text line is taken, and a histogram of the width of the projection break is calculated. Create a word gap judgment threshold according to the maximum width corresponding to the peak, judge a projection break with a width equal to or greater than the judgment threshold as a gap between words, and cut out words from the image information of the text line. A word extraction method characterized by:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP60274051A JPS62133585A (en) | 1985-12-05 | 1985-12-05 | Word segmenting system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP60274051A JPS62133585A (en) | 1985-12-05 | 1985-12-05 | Word segmenting system |
Publications (1)
Publication Number | Publication Date |
---|---|
JPS62133585A true JPS62133585A (en) | 1987-06-16 |
Family
ID=17536281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP60274051A Pending JPS62133585A (en) | 1985-12-05 | 1985-12-05 | Word segmenting system |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPS62133585A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63103390A (en) * | 1986-10-20 | 1988-05-09 | Sharp Corp | Word processing system |
JPS63158678A (en) * | 1986-12-23 | 1988-07-01 | Sharp Corp | Inter-word space detecting method |
JPH02255995A (en) * | 1988-04-28 | 1990-10-16 | Seiko Epson Corp | Character segmenting method |
JPH03225576A (en) * | 1990-01-31 | 1991-10-04 | Oki Electric Ind Co Ltd | Device for segmenting word |
US5357581A (en) * | 1991-11-01 | 1994-10-18 | Eastman Kodak Company | Method and apparatus for the selective filtering of dot-matrix printed characters so as to improve optical character recognition |
US5394482A (en) * | 1991-11-01 | 1995-02-28 | Eastman Kodak Company | Method and apparatus for the detection of dot-matrix printed text so as to improve optical character recognition |
JPH07319998A (en) * | 1988-04-28 | 1995-12-08 | Seiko Epson Corp | Character cutting method |
-
1985
- 1985-12-05 JP JP60274051A patent/JPS62133585A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63103390A (en) * | 1986-10-20 | 1988-05-09 | Sharp Corp | Word processing system |
JPS63158678A (en) * | 1986-12-23 | 1988-07-01 | Sharp Corp | Inter-word space detecting method |
JPH02255995A (en) * | 1988-04-28 | 1990-10-16 | Seiko Epson Corp | Character segmenting method |
JPH07319998A (en) * | 1988-04-28 | 1995-12-08 | Seiko Epson Corp | Character cutting method |
JPH03225576A (en) * | 1990-01-31 | 1991-10-04 | Oki Electric Ind Co Ltd | Device for segmenting word |
US5357581A (en) * | 1991-11-01 | 1994-10-18 | Eastman Kodak Company | Method and apparatus for the selective filtering of dot-matrix printed characters so as to improve optical character recognition |
US5394482A (en) * | 1991-11-01 | 1995-02-28 | Eastman Kodak Company | Method and apparatus for the detection of dot-matrix printed text so as to improve optical character recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5664027A (en) | Methods and apparatus for inferring orientation of lines of text | |
EP0544430B1 (en) | Method and apparatus for determining the frequency of words in a document without document image decoding | |
EP0544434A2 (en) | Method and apparatus for processing a document image | |
JP5508359B2 (en) | Character recognition device, character recognition method and program | |
US5561720A (en) | Method for extracting individual characters from raster images of a read-in handwritten or typed character sequence having a free pitch | |
JPS62133585A (en) | Word segmenting system | |
US4887301A (en) | Proportional spaced text recognition apparatus and method | |
CN113553852A (en) | Contract information extraction method, system and storage medium based on neural network | |
JP2000089786A (en) | Method and apparatus for correcting speech recognition result | |
JP3537570B2 (en) | Space detection method for Japanese-English mixed documents, pitch format determination method, and space detection method for fixed-pitch alphanumeric character strings | |
JP2915175B2 (en) | Word space detection method | |
JPS6226587A (en) | Character field free pitch processing system for optical character reader | |
JP2968354B2 (en) | Post-processing method of character recognition result | |
JPH02230484A (en) | character recognition device | |
JP2985813B2 (en) | Character string recognition device and knowledge database learning method | |
JPH04335487A (en) | Character segmenting method for character recognizing device | |
JP2746345B2 (en) | Post-processing method for character recognition | |
JP2887823B2 (en) | Document recognition device | |
JP3086264B2 (en) | Character space recognition method | |
JP2891368B2 (en) | Post-processing method of character recognition result | |
JPH10171924A (en) | Character recognition device | |
JP2851102B2 (en) | Character extraction method | |
JPH05225183A (en) | Automatic error detector for words in japanese sentence | |
JPH01277989A (en) | Character string pattern reader | |
JP3345469B2 (en) | Word spacing calculation method, word spacing calculation device, character reading method, character reading device |