JPS62133585A

JPS62133585A - Word segmenting system

Info

Publication number: JPS62133585A
Application number: JP60274051A
Authority: JP
Inventors: Koichi Ejiri; 公一江尻
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1985-12-05
Filing date: 1985-12-05
Publication date: 1987-06-16

Abstract

PURPOSE:To segment exactly a word from image information which has been read from an original of European language, etc., by deciding a break of a projection of width exceeding a deciding threshold value, to be a gap between words. CONSTITUTION:A vertical projection extracting circuit 20 derives a projection in the vertical direction of sentence line image information which has been inputted from a line buffer 16. Also, to a gap histogram generating circuit 22, information of a vertical projection of one sentence line is inputted from the circuit 20, and a gap histogram is generated by calculating a frequency of width of a break of the vertical projection. This histogram is inputted to a deciding threshold value determining circuit 24, detects the maximum width corresponding to a peak and sends it to a word segmenting circuit 26. The circuit 26 discriminates the break of the vertical projection of width exceeding a word gap deciding threshold value, as an inter-word gap, from the information of the vertical projection which is inputted from the circuit 20, and segments image information of a word which has been delimited by the inter-word gap from the sentence line image information.

Description

【発明の詳細な説明】〔技術分野〕本発明は、欧文原稿などの読取画像情報から単語を切り
出すための１１を語切出方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field] The present invention relates to a word extraction method 11 for cutting out words from read image information of a Roman manuscript or the like.

[Prior art]

文書処理を行う装置類において、欧文などを処理する場
合１文字ｌｉ位の処理だけでなく、単語！１を位の処理
が必要になることが多い。In document processing devices, when processing Western languages, it is not only necessary to process single characters, but also words! Processing of the ones place is often necessary.

例えば欧文を処理するＯＣＲにおいては、個々の文字を
認識するだけではなく、認識した文字の集まりである単
語を単語辞書と比較することにより、認識エラーを修正
ないし防＋ｆｚ　したり、あるいは、認識不可能な文字
を推定するなどの単語単位の処理を行うことがある。こ
の場合に、欧文原稿の読取画像情報から文章行を切り出
し、さらに文字を切り出すだけではなく、単Ｊ５の切出
も必要である。For example, in OCR that processes Roman characters, it not only recognizes individual characters, but also corrects or prevents recognition errors by comparing words, which are collections of recognized characters, with a word dictionary. May perform word-by-word processing, such as estimating possible characters. In this case, it is necessary not only to cut out text lines and characters from the read image information of the European manuscript, but also to cut out J5.

同様の１′Ｌ語切出処理は、欧文を邦文に翻訳するコン
ピュータ翻訳システムにおいては不可欠である。Similar 1'L word extraction processing is essential in computer translation systems that translate European texts into Japanese texts.

従来、そのような単語切出処理は、読取画像から切り出
された文章行の画像情報の垂直射影をとり、その射影の
切れ［１の幅榮予め設定された固定のＱｔ語間ギャップ
の判定閾値と比較し、その判定閾値以上の幅の射影の切
れ［１を単語の切れ目とみなし１文章行画像情報から１
１１語を切り出している。Conventionally, such word extraction processing takes a vertical projection of the image information of a line of text extracted from a read image, and then calculates the cut of the projection by using a fixed Qt interword gap judgment threshold set in advance. , the projection break with a width greater than the judgment threshold [1 is regarded as a word break, and 1 sentence line image information to 1
11 words have been extracted.

しかし、欧文雑誌などはプロポーション・ピッチで印刷
されているのが一般的であって、文字間ギャップ幅およ
び単語間ギャップ幅は一定しておらず、一定の閾値では
単語間ギャップの判定エラーが起きやすく、その結果、
単語の切出エラーが起きやすい。However, European magazines are generally printed with proportion pitch, and the gap widths between characters and words are not constant, and if a certain threshold is used, an error in determining the gap between words may occur. As a result,
Word segmentation errors are likely to occur.

また、必ずしもプロポーショナル・ピッチ印刷の欧文原
稿でなくとも、一定の判定閾値を用いる方法では、単語
の切出エラーが頻繁に起こる場合がある。例えば異サイ
ズの文字が混在している欧文原稿の場合がそうである。Further, even if the manuscript is not necessarily a European manuscript printed with a proportional pitch, a method using a fixed determination threshold may frequently cause word extraction errors. This is the case, for example, in the case of a European manuscript in which characters of different sizes are mixed.

つまり、大きな文字サイズで印刷された文字間キャップ
の幅が小さな文字サイズで印刷された単語間ギャップの
幅より大きくなることがあるため、従来のように一定の
判定閾値を用いたのでは！ｌｊ−ｎｒｉ間ギャップを正
しく判定できずに単語の切出エラーが起こる確率が高℃
）。In other words, the width of the inter-character caps printed in large font sizes may be larger than the width of inter-word gaps printed in small font sizes, so why not use a fixed judgment threshold as in the past? There is a high probability that a word segmentation error will occur due to the inability to correctly determine the lj-nri gap.
).

[verbal]

本発明の目的は、プロポーショナル・ピッチ印刷の欧文
原稿、人文字サイズの文字が混在しているような欧文原
稿などから読み取られた画像情報から、単語を確実に切
り出すための単語切出方式を提供することにある。An object of the present invention is to provide a word extraction method for reliably extracting words from image information read from a European manuscript printed with proportional pitch printing, a European manuscript containing human-sized characters, etc. It's about doing.

〔composition〕

この目的を達成すべくなされた本発明の単語切出方式は
、文章行の画像情報の垂直方向の射影をとり、その射影
の切れ目の幅のヒストグラムを作成し、そのピークに対
応する最大の幅に従って単語間ギャップの判定閾値を決
定し、その判定閾値以上の幅の射影の切れ目を単語間の
ギャップと判定し、文章行の画像情報から単語を切り出
すことを特徴とするものである。The word extraction method of the present invention, which was made to achieve this purpose, takes a vertical projection of the image information of a text line, creates a histogram of the widths of the projection breaks, and calculates the maximum width corresponding to the peak. According to this method, a threshold for determining an inter-word gap is determined, a projection break having a width equal to or greater than the determination threshold is determined to be a gap between words, and a word is extracted from image information of a text line.

〔Example〕

以下、本発明の一実施例について図面を参照し説明する
。An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の単語切出方式の適用された文書処理装
置の要部の構成を示す概略ブロック図である。同図にお
いて、１０は文書原稿（例えば英文原稿）を読み取るた
めのスキャナであり、このスキャナ１０により読み取ら
れた文書原稿の画像情報は画像バッファ１２に蓄積され
る。この画像バッファ１２より画像情報は行切出部１４
に順次入力され、文章行の画像情報が切り出される。FIG. 1 is a schematic block diagram showing the configuration of the main parts of a document processing device to which the word extraction method of the present invention is applied. In the figure, reference numeral 10 denotes a scanner for reading a document (for example, an English document), and image information of the document read by this scanner 10 is stored in an image buffer 12. The image information from this image buffer 12 is extracted by the line cutting unit 14.
are input sequentially into the text line, and the image information of the text line is cut out.

この行切出処理は、例えば水平方向（行方向）の射影を
求め、その射影の谷と谷の間を切り出すという一般的な
射影法によって行われる。勿論。This line cutting process is performed, for example, by a general projection method in which a horizontal direction (line direction) projection is obtained and the valleys of the projection are cut out. Of course.

部分領域毎の射影をとる改良型射影法によって行切男し
を行ってもよい。切り出された文章行の画像情報は行バ
ッファ１６に一時的に蓄積され、後段の単語切出部１８
に入力される。Gyokiri Oshishi may be performed using an improved projection method that takes projections for each partial region. The image information of the cut out sentence lines is temporarily stored in the line buffer 16, and then sent to the word cutting unit 18 in the subsequent stage.
is input.

このη１語切出部１８は垂直射影抽出回路２０、ギャッ
プヒスドグ９１１作成回路２２、’ｌ’−ｌ！ｎ間ギャ
ップ判定閾値決定回路２４および単語切出回路２６から
なっている。This η1 word extraction unit 18 includes a vertical projection extraction circuit 20, a gap hisdog 911 creation circuit 22, 'l'-l! It consists of an n-gap judgment threshold determining circuit 24 and a word cutting circuit 26.

垂直射影抽出回路２０は行バッファ１６から人力された
文章行画像情報の垂直方向（文章行に対して直角の方向
）の射影を求める回路である。この射影の連続した部分
は文字の範囲に対応し、射影の切れ「１（山部）は文字
と文字の間のギャップまたは１１１語と１１１、語の間
のギャップである。The vertical projection extraction circuit 20 is a circuit that obtains a projection of the text line image information input manually from the line buffer 16 in the vertical direction (direction perpendicular to the text line). The continuous part of this projection corresponds to the range of characters, and the break "1 (mountain part)" of the projection is the gap between characters or the gap between 111 words and 111 words.

前述のように、プロポーショナル・ピッチ印刷の欧文原
稿などにあっては、文字間ギャップおよび単語間ギャッ
プが大輪に変動するため、単語１ｊｌＴギヤツプを識別
するための判定閾値を文章行毎に適切に設定する必要が
ある。そのための回路がギャップヒストグラム作成回路
２２および単語間ギャップ判定閾値決定回路２４である
。As mentioned above, in European manuscripts with proportional pitch printing, the inter-character gaps and inter-word gaps fluctuate widely, so it is necessary to appropriately set the judgment threshold for identifying word 1jlT gaps for each line of text. There is a need to. The circuits for this purpose are a gap histogram creation circuit 22 and an inter-word gap determination threshold determination circuit 24.

ギャップヒストグラム作成回路２２は垂直射影抽出回路
２０より一つの文章行の垂直射影の情報を人力され、そ
の垂直射影の切れ目（ギャップ）の幅の頻度を計数して
ギャップヒストグラムを作成する。例えば第２図に示す
ような欧文原稿の画像情報が入力された場合を想定する
。この図において、ＬＬ、Ｌ２．Ｌ３はそれぞれ文章行
を意味し、斜線を施した範囲がそれぞれ単語を意味し、
単語の間の空白部は！１を語間のギャップを意味する。The gap histogram creation circuit 22 receives information on the vertical projection of one text line from the vertical projection extraction circuit 20, counts the frequency of the width of the gap in the vertical projection, and creates a gap histogram. For example, assume that image information of a Roman manuscript as shown in FIG. 2 is input. In this figure, LL, L2. Each L3 means a line of text, each shaded area means a word,
The spaces between words! 1 means the gap between words.

いまＬ２の文章行の画像情報が単語切７１′ｊ部１８に
入力された場合、第７３図に示すようなギャップヒス１
−グラムが得られる。このギャップヒストグラム中のＡ
の範囲は文字間ギャップに対応し、またＢの範囲は単語
間ギャップに対応している。If the image information of the text line of L2 is now input to the word cutter 71'j part 18, the gap histogram 1 as shown in FIG.
- grams are obtained. A in this gap histogram
The range B corresponds to the inter-character gap, and the range B corresponds to the inter-word gap.

単語間ギャップ判定閾値決定回路２４は、そのようなギ
ャップヒストグラムの情報を入力され、そのギャップヒ
ストグラムのピークに対応する最大の幅（Ｇｍ）を検出
する。但し、そのピークは所定値（例えば２）以上の頻
度値のものとする。The inter-word gap determination threshold determination circuit 24 receives information on such a gap histogram and detects the maximum width (Gm) corresponding to the peak of the gap histogram. However, the peak has a frequency value of a predetermined value (for example, 2) or more.

例えば第３図に示すギャップヒストグラムの場合、範囲
Ｂのピークに対応するギャップ幅がＧｍとして検出され
る。For example, in the case of the gap histogram shown in FIG. 3, the gap width corresponding to the peak in range B is detected as Gm.

そして単語間ギャップ判定閾値決定回路２４は、単語間
ギャップ判定閾値Ｇｔを式％式％（こ＼でＧδはＯ＜Ｇδ（Ｇｍの定数）によって算定し
、その単ｄ１１間ギャップ判定閾値Ｇ１、の情報を単語
切出回路２６に送る。Then, the inter-word gap judgment threshold determination circuit 24 calculates the inter-word gap judgment threshold Gt by the formula % (where Gδ is O<Gδ (constant of Gm)), and calculates the inter-word gap judgment threshold G1, The information is sent to the word extraction circuit 26.

１１語り出回路２６は、垂直射影抽出回路２０から入力
される垂直射影の情報から単、！ｒｉ間ギャップ判定閾
値以上の幅の垂直射影の切れ目を単語間ギャップとして
識別し１文章行画像情報から単語間ギャップによって区
切られた単語の画像情報を切り出す。11 The narration circuit 26 extracts ! from the vertical projection information inputted from the vertical projection extraction circuit 20. A break in the vertical projection with a width equal to or greater than an inter-ri gap determination threshold is identified as an inter-word gap, and image information of words separated by the inter-word gap is extracted from one sentence line image information.

例えば、垂直射影が第４図のようになる二つの文章行の
画像情報が単語切出部１８に順次入力されたとする。図
中、斜線を施した範囲は垂直射影の連続している範囲で
あり、それぞれ１個ないし複数個の文字の列または単語
に相当する。プロポーショナル・ピッチ印刷の場合には
１文章行の垂直射影はこの例のようにギャップ幅が変動
する。For example, assume that image information of two text lines whose vertical projections are as shown in FIG. 4 are sequentially input to the word extraction section 18. In the figure, the hatched range is a continuous range of vertical projection, and each corresponds to one or more character strings or words. In the case of proportional pitch printing, the gap width in the vertical projection of one text line varies as shown in this example.

このような文章行では、矢印↑を付した垂直射影の切れ
Ｌｌが単語間ギャップと判定され、単語が切り出される
。In such a sentence line, a vertical projection cut Ll with an arrow ↑ is determined to be an inter-word gap, and a word is cut out.

このように、文章行の垂直射影のギャップヒストグラム
に基づき単語間ギャップ判定閾値がダイナミックに決定
されるため、プロポーショナル・ピッチ印刷の文書、異
なったサイズの文字が混在した文書の場合にも、単語間
キャップを間違いなく識別した正確に単；（６を切り出
すことができる。In this way, the inter-word gap judgment threshold is dynamically determined based on the gap histogram of the vertical projection of text lines, so even in the case of proportional pitch printed documents or documents with characters of different sizes, it is possible to It is possible to cut out the exact number (6) that has definitely identified the cap.

なお、単語切出部［８の機能はマイクロプロセッサなど
を用いてソフ１へウェア処理により実現してもよい。こ
れ以外にも１本発明は種々変形して実施しつるものであ
る。Note that the function of the word extraction unit [8 may be realized by software processing in the software 1 using a microprocessor or the like. In addition to this, the present invention can be implemented with various modifications.

〔effect〕

以」二の詳細な説明から明らかなように、本発明は、文
章行の垂直射影のギャップヒストグラムに基づき単語間
ギャップ判定閾値をダイナミックに決定して単語切出を
行うため、プロポーショナル・ピッチ印刷の文書、異な
ったサイズの文字が４Ａ在した文書の場合にも正確に単
語を切り出すことができる。As is clear from the following detailed description, the present invention dynamically determines the gap judgment threshold between words based on the gap histogram of the vertical projection of text lines to cut out words. Words can be accurately extracted even in a document containing 4A characters of different sizes.

[Brief explanation of drawings]

第１図は本発明の単語切出方式を適用した文書処理装置
の要部構成のみを示す概略ブロック図、第２図はギャッ
プヒス１−グラムの説明のための文章行の一例を示す図
、第３図はギャップヒストグラムの一例を示す図、第４
図は文章行の垂直射影とその単語間ギャップを対照させ
て示す図である。１０・・・スキャナ、　　１２・・・画像バッファ、１
４・・行切出部、　　１６・・行バッファ、１８・、単
語切出部、　　２ｏ・・・１１峠直射影抽出回路、２２
・・・ギャップヒストグラム作成回路、２４・・・単語
１ｊ１１ギャップ判定闇値決定回路、２６・・１１１語
切出回路。代理人弁理士　　鈴　木　　　誠第２図四コロロロ＝コーし１図コロ＝コロニーＬ２０コーロコロｅ−Ｌ３第３図縛度第４図 ↑　　↑↑　↑FIG. 1 is a schematic block diagram showing only the essential configuration of a document processing device to which the word extraction method of the present invention is applied; FIG. 2 is a diagram showing an example of a text line for explaining a gap histogram; Figure 3 shows an example of a gap histogram, and Figure 4 shows an example of a gap histogram.
The figure shows a contrast between the vertical projection of a line of text and its inter-word gaps. 10...Scanner, 12...Image buffer, 1
4. Line extraction unit, 16. Line buffer, 18. Word extraction unit, 2o... 11 Pass direct projection extraction circuit, 22
. . . Gap histogram creation circuit, 24 . . . Word 1j11 gap judgment dark value determination circuit, 26 . . 111 word extraction circuit. Representative Patent Attorney Makoto Suzuki Figure 2 4 Korororo = Koshi 1 Figure Coro = Colony L2 0 Kororo Coro e-L3 Figure 3 Restriction Figure 4 ↑ ↑↑ ↑

Claims

[Claims]

(1) In a document processing device that inputs image information of a document and performs processing including cutting out words from the document, a vertical projection of the image information of a text line is taken, and a histogram of the width of the projection break is calculated. Create a word gap judgment threshold according to the maximum width corresponding to the peak, judge a projection break with a width equal to or greater than the judgment threshold as a gap between words, and cut out words from the image information of the text line. A word extraction method characterized by: