[go: up one dir, main page]

JPS62133585A - Word segmenting system - Google Patents

Word segmenting system

Info

Publication number
JPS62133585A
JPS62133585A JP60274051A JP27405185A JPS62133585A JP S62133585 A JPS62133585 A JP S62133585A JP 60274051 A JP60274051 A JP 60274051A JP 27405185 A JP27405185 A JP 27405185A JP S62133585 A JPS62133585 A JP S62133585A
Authority
JP
Japan
Prior art keywords
word
gap
circuit
image information
projection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP60274051A
Other languages
Japanese (ja)
Inventor
Koichi Ejiri
公一 江尻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP60274051A priority Critical patent/JPS62133585A/en
Publication of JPS62133585A publication Critical patent/JPS62133585A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

PURPOSE:To segment exactly a word from image information which has been read from an original of European language, etc., by deciding a break of a projection of width exceeding a deciding threshold value, to be a gap between words. CONSTITUTION:A vertical projection extracting circuit 20 derives a projection in the vertical direction of sentence line image information which has been inputted from a line buffer 16. Also, to a gap histogram generating circuit 22, information of a vertical projection of one sentence line is inputted from the circuit 20, and a gap histogram is generated by calculating a frequency of width of a break of the vertical projection. This histogram is inputted to a deciding threshold value determining circuit 24, detects the maximum width corresponding to a peak and sends it to a word segmenting circuit 26. The circuit 26 discriminates the break of the vertical projection of width exceeding a word gap deciding threshold value, as an inter-word gap, from the information of the vertical projection which is inputted from the circuit 20, and segments image information of a word which has been delimited by the inter-word gap from the sentence line image information.

Description

【発明の詳細な説明】 〔技術分野〕 本発明は、欧文原稿などの読取画像情報から単語を切り
出すための11を語切出方式に関する。
DETAILED DESCRIPTION OF THE INVENTION [Technical Field] The present invention relates to a word extraction method 11 for cutting out words from read image information of a Roman manuscript or the like.

〔従来技術〕[Prior art]

文書処理を行う装置類において、欧文などを処理する場
合1文字li位の処理だけでなく、単語!1を位の処理
が必要になることが多い。
In document processing devices, when processing Western languages, it is not only necessary to process single characters, but also words! Processing of the ones place is often necessary.

例えば欧文を処理するOCRにおいては、個々の文字を
認識するだけではなく、認識した文字の集まりである単
語を単語辞書と比較することにより、認識エラーを修正
ないし防+fz したり、あるいは、認識不可能な文字
を推定するなどの単語単位の処理を行うことがある。こ
の場合に、欧文原稿の読取画像情報から文章行を切り出
し、さらに文字を切り出すだけではなく、単J5の切出
も必要である。
For example, in OCR that processes Roman characters, it not only recognizes individual characters, but also corrects or prevents recognition errors by comparing words, which are collections of recognized characters, with a word dictionary. May perform word-by-word processing, such as estimating possible characters. In this case, it is necessary not only to cut out text lines and characters from the read image information of the European manuscript, but also to cut out J5.

同様の1′L語切出処理は、欧文を邦文に翻訳するコン
ピュータ翻訳システムにおいては不可欠である。
Similar 1'L word extraction processing is essential in computer translation systems that translate European texts into Japanese texts.

従来、そのような単語切出処理は、読取画像から切り出
された文章行の画像情報の垂直射影をとり、その射影の
切れ[1の幅榮予め設定された固定のQt語間ギャップ
の判定閾値と比較し、その判定閾値以上の幅の射影の切
れ[1を単語の切れ目とみなし1文章行画像情報から1
11語を切り出している。
Conventionally, such word extraction processing takes a vertical projection of the image information of a line of text extracted from a read image, and then calculates the cut of the projection by using a fixed Qt interword gap judgment threshold set in advance. , the projection break with a width greater than the judgment threshold [1 is regarded as a word break, and 1 sentence line image information to 1
11 words have been extracted.

しかし、欧文雑誌などはプロポーション・ピッチで印刷
されているのが一般的であって、文字間ギャップ幅およ
び単語間ギャップ幅は一定しておらず、一定の閾値では
単語間ギャップの判定エラーが起きやすく、その結果、
単語の切出エラーが起きやすい。
However, European magazines are generally printed with proportion pitch, and the gap widths between characters and words are not constant, and if a certain threshold is used, an error in determining the gap between words may occur. As a result,
Word segmentation errors are likely to occur.

また、必ずしもプロポーショナル・ピッチ印刷の欧文原
稿でなくとも、一定の判定閾値を用いる方法では、単語
の切出エラーが頻繁に起こる場合がある。例えば異サイ
ズの文字が混在している欧文原稿の場合がそうである。
Further, even if the manuscript is not necessarily a European manuscript printed with a proportional pitch, a method using a fixed determination threshold may frequently cause word extraction errors. This is the case, for example, in the case of a European manuscript in which characters of different sizes are mixed.

つまり、大きな文字サイズで印刷された文字間キャップ
の幅が小さな文字サイズで印刷された単語間ギャップの
幅より大きくなることがあるため、従来のように一定の
判定閾値を用いたのでは!lj−nri間ギャップを正
しく判定できずに単語の切出エラーが起こる確率が高℃
)。
In other words, the width of the inter-character caps printed in large font sizes may be larger than the width of inter-word gaps printed in small font sizes, so why not use a fixed judgment threshold as in the past? There is a high probability that a word segmentation error will occur due to the inability to correctly determine the lj-nri gap.
).

〔口 的〕[verbal]

本発明の目的は、プロポーショナル・ピッチ印刷の欧文
原稿、人文字サイズの文字が混在しているような欧文原
稿などから読み取られた画像情報から、単語を確実に切
り出すための単語切出方式を提供することにある。
An object of the present invention is to provide a word extraction method for reliably extracting words from image information read from a European manuscript printed with proportional pitch printing, a European manuscript containing human-sized characters, etc. It's about doing.

〔構 成〕〔composition〕

この目的を達成すべくなされた本発明の単語切出方式は
、文章行の画像情報の垂直方向の射影をとり、その射影
の切れ目の幅のヒストグラムを作成し、そのピークに対
応する最大の幅に従って単語間ギャップの判定閾値を決
定し、その判定閾値以上の幅の射影の切れ目を単語間の
ギャップと判定し、文章行の画像情報から単語を切り出
すことを特徴とするものである。
The word extraction method of the present invention, which was made to achieve this purpose, takes a vertical projection of the image information of a text line, creates a histogram of the widths of the projection breaks, and calculates the maximum width corresponding to the peak. According to this method, a threshold for determining an inter-word gap is determined, a projection break having a width equal to or greater than the determination threshold is determined to be a gap between words, and a word is extracted from image information of a text line.

〔実施例〕〔Example〕

以下、本発明の一実施例について図面を参照し説明する
An embodiment of the present invention will be described below with reference to the drawings.

第1図は本発明の単語切出方式の適用された文書処理装
置の要部の構成を示す概略ブロック図である。同図にお
いて、10は文書原稿(例えば英文原稿)を読み取るた
めのスキャナであり、このスキャナ10により読み取ら
れた文書原稿の画像情報は画像バッファ12に蓄積され
る。この画像バッファ12より画像情報は行切出部14
に順次入力され、文章行の画像情報が切り出される。
FIG. 1 is a schematic block diagram showing the configuration of the main parts of a document processing device to which the word extraction method of the present invention is applied. In the figure, reference numeral 10 denotes a scanner for reading a document (for example, an English document), and image information of the document read by this scanner 10 is stored in an image buffer 12. The image information from this image buffer 12 is extracted by the line cutting unit 14.
are input sequentially into the text line, and the image information of the text line is cut out.

この行切出処理は、例えば水平方向(行方向)の射影を
求め、その射影の谷と谷の間を切り出すという一般的な
射影法によって行われる。勿論。
This line cutting process is performed, for example, by a general projection method in which a horizontal direction (line direction) projection is obtained and the valleys of the projection are cut out. Of course.

部分領域毎の射影をとる改良型射影法によって行切男し
を行ってもよい。切り出された文章行の画像情報は行バ
ッファ16に一時的に蓄積され、後段の単語切出部18
に入力される。
Gyokiri Oshishi may be performed using an improved projection method that takes projections for each partial region. The image information of the cut out sentence lines is temporarily stored in the line buffer 16, and then sent to the word cutting unit 18 in the subsequent stage.
is input.

このη1語切出部18は垂直射影抽出回路20、ギャッ
プヒスドグ911作成回路22、’l’−l!n間ギャ
ップ判定閾値決定回路24および単語切出回路26から
なっている。
This η1 word extraction unit 18 includes a vertical projection extraction circuit 20, a gap hisdog 911 creation circuit 22, 'l'-l! It consists of an n-gap judgment threshold determining circuit 24 and a word cutting circuit 26.

垂直射影抽出回路20は行バッファ16から人力された
文章行画像情報の垂直方向(文章行に対して直角の方向
)の射影を求める回路である。この射影の連続した部分
は文字の範囲に対応し、射影の切れ「1(山部)は文字
と文字の間のギャップまたは111語と111、語の間
のギャップである。
The vertical projection extraction circuit 20 is a circuit that obtains a projection of the text line image information input manually from the line buffer 16 in the vertical direction (direction perpendicular to the text line). The continuous part of this projection corresponds to the range of characters, and the break "1 (mountain part)" of the projection is the gap between characters or the gap between 111 words and 111 words.

前述のように、プロポーショナル・ピッチ印刷の欧文原
稿などにあっては、文字間ギャップおよび単語間ギャッ
プが大輪に変動するため、単語1jlTギヤツプを識別
するための判定閾値を文章行毎に適切に設定する必要が
ある。そのための回路がギャップヒストグラム作成回路
22および単語間ギャップ判定閾値決定回路24である
As mentioned above, in European manuscripts with proportional pitch printing, the inter-character gaps and inter-word gaps fluctuate widely, so it is necessary to appropriately set the judgment threshold for identifying word 1jlT gaps for each line of text. There is a need to. The circuits for this purpose are a gap histogram creation circuit 22 and an inter-word gap determination threshold determination circuit 24.

ギャップヒストグラム作成回路22は垂直射影抽出回路
20より一つの文章行の垂直射影の情報を人力され、そ
の垂直射影の切れ目(ギャップ)の幅の頻度を計数して
ギャップヒストグラムを作成する。例えば第2図に示す
ような欧文原稿の画像情報が入力された場合を想定する
。この図において、LL、L2.L3はそれぞれ文章行
を意味し、斜線を施した範囲がそれぞれ単語を意味し、
単語の間の空白部は!1を語間のギャップを意味する。
The gap histogram creation circuit 22 receives information on the vertical projection of one text line from the vertical projection extraction circuit 20, counts the frequency of the width of the gap in the vertical projection, and creates a gap histogram. For example, assume that image information of a Roman manuscript as shown in FIG. 2 is input. In this figure, LL, L2. Each L3 means a line of text, each shaded area means a word,
The spaces between words! 1 means the gap between words.

いまL2の文章行の画像情報が単語切71′j部18に
入力された場合、第73図に示すようなギャップヒス1
−グラムが得られる。このギャップヒストグラム中のA
の範囲は文字間ギャップに対応し、またBの範囲は単語
間ギャップに対応している。
If the image information of the text line of L2 is now input to the word cutter 71'j part 18, the gap histogram 1 as shown in FIG.
- grams are obtained. A in this gap histogram
The range B corresponds to the inter-character gap, and the range B corresponds to the inter-word gap.

単語間ギャップ判定閾値決定回路24は、そのようなギ
ャップヒストグラムの情報を入力され、そのギャップヒ
ストグラムのピークに対応する最大の幅(Gm)を検出
する。但し、そのピークは所定値(例えば2)以上の頻
度値のものとする。
The inter-word gap determination threshold determination circuit 24 receives information on such a gap histogram and detects the maximum width (Gm) corresponding to the peak of the gap histogram. However, the peak has a frequency value of a predetermined value (for example, 2) or more.

例えば第3図に示すギャップヒストグラムの場合、範囲
Bのピークに対応するギャップ幅がGmとして検出され
る。
For example, in the case of the gap histogram shown in FIG. 3, the gap width corresponding to the peak in range B is detected as Gm.

そして単語間ギャップ判定閾値決定回路24は、単語間
ギャップ判定閾値Gtを式 %式% (こ\でGδはO<Gδ(Gmの定数)によって算定し
、その単d11間ギャップ判定閾値G1、の情報を単語
切出回路26に送る。
Then, the inter-word gap judgment threshold determination circuit 24 calculates the inter-word gap judgment threshold Gt by the formula % (where Gδ is O<Gδ (constant of Gm)), and calculates the inter-word gap judgment threshold G1, The information is sent to the word extraction circuit 26.

11語り出回路26は、垂直射影抽出回路20から入力
される垂直射影の情報から単、!ri間ギャップ判定閾
値以上の幅の垂直射影の切れ目を単語間ギャップとして
識別し1文章行画像情報から単語間ギャップによって区
切られた単語の画像情報を切り出す。
11 The narration circuit 26 extracts ! from the vertical projection information inputted from the vertical projection extraction circuit 20. A break in the vertical projection with a width equal to or greater than an inter-ri gap determination threshold is identified as an inter-word gap, and image information of words separated by the inter-word gap is extracted from one sentence line image information.

例えば、垂直射影が第4図のようになる二つの文章行の
画像情報が単語切出部18に順次入力されたとする。図
中、斜線を施した範囲は垂直射影の連続している範囲で
あり、それぞれ1個ないし複数個の文字の列または単語
に相当する。プロポーショナル・ピッチ印刷の場合には
1文章行の垂直射影はこの例のようにギャップ幅が変動
する。
For example, assume that image information of two text lines whose vertical projections are as shown in FIG. 4 are sequentially input to the word extraction section 18. In the figure, the hatched range is a continuous range of vertical projection, and each corresponds to one or more character strings or words. In the case of proportional pitch printing, the gap width in the vertical projection of one text line varies as shown in this example.

このような文章行では、矢印↑を付した垂直射影の切れ
Llが単語間ギャップと判定され、単語が切り出される
In such a sentence line, a vertical projection cut Ll with an arrow ↑ is determined to be an inter-word gap, and a word is cut out.

このように、文章行の垂直射影のギャップヒストグラム
に基づき単語間ギャップ判定閾値がダイナミックに決定
されるため、プロポーショナル・ピッチ印刷の文書、異
なったサイズの文字が混在した文書の場合にも、単語間
キャップを間違いなく識別した正確に単;(6を切り出
すことができる。
In this way, the inter-word gap judgment threshold is dynamically determined based on the gap histogram of the vertical projection of text lines, so even in the case of proportional pitch printed documents or documents with characters of different sizes, it is possible to It is possible to cut out the exact number (6) that has definitely identified the cap.

なお、単語切出部[8の機能はマイクロプロセッサなど
を用いてソフ1へウェア処理により実現してもよい。こ
れ以外にも1本発明は種々変形して実施しつるものであ
る。
Note that the function of the word extraction unit [8 may be realized by software processing in the software 1 using a microprocessor or the like. In addition to this, the present invention can be implemented with various modifications.

〔効 果〕〔effect〕

以」二の詳細な説明から明らかなように、本発明は、文
章行の垂直射影のギャップヒストグラムに基づき単語間
ギャップ判定閾値をダイナミックに決定して単語切出を
行うため、プロポーショナル・ピッチ印刷の文書、異な
ったサイズの文字が4A在した文書の場合にも正確に単
語を切り出すことができる。
As is clear from the following detailed description, the present invention dynamically determines the gap judgment threshold between words based on the gap histogram of the vertical projection of text lines to cut out words. Words can be accurately extracted even in a document containing 4A characters of different sizes.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明の単語切出方式を適用した文書処理装置
の要部構成のみを示す概略ブロック図、第2図はギャッ
プヒス1−グラムの説明のための文章行の一例を示す図
、第3図はギャップヒストグラムの一例を示す図、第4
図は文章行の垂直射影とその単語間ギャップを対照させ
て示す図である。 10・・・スキャナ、  12・・・画像バッファ、1
4・・行切出部、  16・・行バッファ、18・、単
語切出部、  2o・・・11峠直射影抽出回路、22
・・・ギャップヒストグラム作成回路、24・・・単語
1j11ギャップ判定闇値決定回路、26・・111語
切出回路。 代理人弁理士  鈴 木   誠 第2図 四コロロロ=コーし1 図コロ=コロニーL2 0コーロコロe−L3 第3図 縛 度 第4図 ↑  ↑↑ ↑
FIG. 1 is a schematic block diagram showing only the essential configuration of a document processing device to which the word extraction method of the present invention is applied; FIG. 2 is a diagram showing an example of a text line for explaining a gap histogram; Figure 3 shows an example of a gap histogram, and Figure 4 shows an example of a gap histogram.
The figure shows a contrast between the vertical projection of a line of text and its inter-word gaps. 10...Scanner, 12...Image buffer, 1
4. Line extraction unit, 16. Line buffer, 18. Word extraction unit, 2o... 11 Pass direct projection extraction circuit, 22
. . . Gap histogram creation circuit, 24 . . . Word 1j11 gap judgment dark value determination circuit, 26 . . 111 word extraction circuit. Representative Patent Attorney Makoto Suzuki Figure 2 4 Korororo = Koshi 1 Figure Coro = Colony L2 0 Kororo Coro e-L3 Figure 3 Restriction Figure 4 ↑ ↑↑ ↑

Claims (1)

【特許請求の範囲】[Claims] (1)文書の画像情報を入力して文書の単語の切出処理
を含む処理を行う文書処理装置において、文章行の画像
情報の垂直方向の射影をとり、その射影の切れ目の幅の
ヒストグラムを作成し、そのピークに対応する最大の幅
に従って単語間ギャップの判定閾値を決定し、その判定
閾値以上の幅の射影の切れ目を単語間のギャップと判定
し、文章行の画像情報から単語を切り出すことを特徴と
する単語切出方式。
(1) In a document processing device that inputs image information of a document and performs processing including cutting out words from the document, a vertical projection of the image information of a text line is taken, and a histogram of the width of the projection break is calculated. Create a word gap judgment threshold according to the maximum width corresponding to the peak, judge a projection break with a width equal to or greater than the judgment threshold as a gap between words, and cut out words from the image information of the text line. A word extraction method characterized by:
JP60274051A 1985-12-05 1985-12-05 Word segmenting system Pending JPS62133585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP60274051A JPS62133585A (en) 1985-12-05 1985-12-05 Word segmenting system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP60274051A JPS62133585A (en) 1985-12-05 1985-12-05 Word segmenting system

Publications (1)

Publication Number Publication Date
JPS62133585A true JPS62133585A (en) 1987-06-16

Family

ID=17536281

Family Applications (1)

Application Number Title Priority Date Filing Date
JP60274051A Pending JPS62133585A (en) 1985-12-05 1985-12-05 Word segmenting system

Country Status (1)

Country Link
JP (1) JPS62133585A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63103390A (en) * 1986-10-20 1988-05-09 Sharp Corp Word processing system
JPS63158678A (en) * 1986-12-23 1988-07-01 Sharp Corp Inter-word space detecting method
JPH02255995A (en) * 1988-04-28 1990-10-16 Seiko Epson Corp Character segmenting method
JPH03225576A (en) * 1990-01-31 1991-10-04 Oki Electric Ind Co Ltd Device for segmenting word
US5357581A (en) * 1991-11-01 1994-10-18 Eastman Kodak Company Method and apparatus for the selective filtering of dot-matrix printed characters so as to improve optical character recognition
US5394482A (en) * 1991-11-01 1995-02-28 Eastman Kodak Company Method and apparatus for the detection of dot-matrix printed text so as to improve optical character recognition
JPH07319998A (en) * 1988-04-28 1995-12-08 Seiko Epson Corp Character cutting method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63103390A (en) * 1986-10-20 1988-05-09 Sharp Corp Word processing system
JPS63158678A (en) * 1986-12-23 1988-07-01 Sharp Corp Inter-word space detecting method
JPH02255995A (en) * 1988-04-28 1990-10-16 Seiko Epson Corp Character segmenting method
JPH07319998A (en) * 1988-04-28 1995-12-08 Seiko Epson Corp Character cutting method
JPH03225576A (en) * 1990-01-31 1991-10-04 Oki Electric Ind Co Ltd Device for segmenting word
US5357581A (en) * 1991-11-01 1994-10-18 Eastman Kodak Company Method and apparatus for the selective filtering of dot-matrix printed characters so as to improve optical character recognition
US5394482A (en) * 1991-11-01 1995-02-28 Eastman Kodak Company Method and apparatus for the detection of dot-matrix printed text so as to improve optical character recognition

Similar Documents

Publication Publication Date Title
US5664027A (en) Methods and apparatus for inferring orientation of lines of text
EP0544430B1 (en) Method and apparatus for determining the frequency of words in a document without document image decoding
EP0544434A2 (en) Method and apparatus for processing a document image
JP5508359B2 (en) Character recognition device, character recognition method and program
US5561720A (en) Method for extracting individual characters from raster images of a read-in handwritten or typed character sequence having a free pitch
JPS62133585A (en) Word segmenting system
US4887301A (en) Proportional spaced text recognition apparatus and method
CN113553852A (en) Contract information extraction method, system and storage medium based on neural network
JP2000089786A (en) Method and apparatus for correcting speech recognition result
JP3537570B2 (en) Space detection method for Japanese-English mixed documents, pitch format determination method, and space detection method for fixed-pitch alphanumeric character strings
JP2915175B2 (en) Word space detection method
JPS6226587A (en) Character field free pitch processing system for optical character reader
JP2968354B2 (en) Post-processing method of character recognition result
JPH02230484A (en) character recognition device
JP2985813B2 (en) Character string recognition device and knowledge database learning method
JPH04335487A (en) Character segmenting method for character recognizing device
JP2746345B2 (en) Post-processing method for character recognition
JP2887823B2 (en) Document recognition device
JP3086264B2 (en) Character space recognition method
JP2891368B2 (en) Post-processing method of character recognition result
JPH10171924A (en) Character recognition device
JP2851102B2 (en) Character extraction method
JPH05225183A (en) Automatic error detector for words in japanese sentence
JPH01277989A (en) Character string pattern reader
JP3345469B2 (en) Word spacing calculation method, word spacing calculation device, character reading method, character reading device