JP4995801B2

JP4995801B2 - Document analysis apparatus, document analysis program, and document analysis method

Info

Publication number: JP4995801B2
Application number: JP2008280765A
Authority: JP
Inventors: 貴志澁谷; 裕美子吉村; 正樹新藤; 遠航蔡
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2008-10-31
Filing date: 2008-10-31
Publication date: 2012-08-08
Anticipated expiration: 2028-10-31
Also published as: JP2010108326A

Description

本発明は、メールツールにより自動的に改行が行われた場合であっても、引用階層を整形して同一階層内で文の認識処理を行う文書解析装置、文書解析プログラムおよび文書解析方法に関する。 The present invention relates to a document analysis apparatus , a document analysis program, and a document analysis method for recognizing a sentence in the same hierarchy by shaping a citation hierarchy even when a line break is automatically performed by a mail tool.

近年、パソコンの普及率は高く、インターネット環境が整備されるに伴って、電子メールによる海外との情報のやり取りが益々盛んになってきている。また、電子メールの翻訳に機械翻訳を使用する利用者も増えてきている。 In recent years, the penetration rate of personal computers is high, and with the establishment of the Internet environment, the exchange of information with overseas by e-mail has become increasingly popular. In addition, an increasing number of users use machine translation for e-mail translation.

これまでの電子メールの利用者は、学校や企業などにおいて電子メールの記述に関するマナー教育を受けており、一定文字数で改行を挿入するなど、読みやすさを考慮した記述が行われていたが、パソコンの普及により、このようなマナーを知らない利用者も増えてきている。この結果、画面の横幅一杯に文字を入力したり、改行なしで記述されたメールも多数見受けられるようになっている。 Until now, e-mail users have received manners education about writing e-mails at schools and companies, etc., and descriptions such as inserting line breaks with a certain number of characters were taken into consideration, With the spread of personal computers, the number of users who do not know such manners is increasing. As a result, a large number of e-mails can be seen in which characters are entered in the full width of the screen or written without line breaks.

このようなことから、メールソフトによっては、指定した桁数で自動的に折り返す(改行する)機能を有しているものがある。この機能が有効な場合には、明示的に改行したところと、自動折り返しによる改行の二重の改行により、非常に読み難い状態になる。また、メールの返信時には元のメールの引用部分に引用記号が挿入されることにより、さらに改行され返信が繰り返されるうちに、もともとは同じ階層の文であっても階層が乱れてしまい、引用部分か返答部分かを判断するのも困難となる。 For this reason, some mail software has a function to automatically wrap (break) a specified number of digits. When this function is enabled, it becomes very difficult to read due to the line breaks explicitly and double line breaks by automatic wrapping. Also, when replying to an email, a quotation mark is inserted in the quoted portion of the original email, so that the line is broken even if the sentence is originally in the same hierarchy while the reply is repeated and the reply is repeated. It is also difficult to determine whether it is a response part.

そこで、パソコン等から受信したメールを表示した場合に本文が不自然な箇所で改行されないようにするために、受信したメールの本文の改行データを整形して不自然な改行を解消するようにしたものがある（例えば、特許文献１参照）。すなわち、この特許文献１のものは、メール文において、引用文の最初の引用記号を記憶し、改行データを検出するとメールソフトにより自動的に挿入された改行か利用者が意図して挿入した改行かを判断して、自動的に挿入された改行と判断した場合は改行を削除して引用文の一部とし、意図した改行と判断した場合は、前述の引用文の終端に引用終端記号を付与し、画面の表示幅に応じて改行の挿入と、引用文字から引用終端記号の間に存在する文の先頭に引用記号を挿入することで、メール文書の可読性を向上させるという手法をとっている。
特開平１１−１８４７７５号公報 Therefore, in order to prevent line breaks in unnatural places when displaying mail received from a PC, etc., the line break data in the body of received mails has been formatted to eliminate unnatural line breaks. There are some (see, for example, Patent Document 1). That is, in Patent Document 1, the first quotation mark of a quotation is stored in a mail sentence, and when a line break data is detected, a line break automatically inserted by mail software or a line break intentionally inserted by a user. If it is determined that it is an automatically inserted line break, the line break is deleted to make it a part of the quoted sentence. The method of improving the readability of the mail document by inserting a line feed according to the display width of the screen and inserting a quotation mark at the beginning of the sentence existing between the quotation character and the quotation terminator. Yes.
JP-A-11-184775

しかしながら、特許文献１のものでは、行頭に引用記号を検出した後のデータを検査し、２回目以降の引用記号を検出した場合は削除する。そして、改行データを検出した場合は、改行データの次の文字、すなわち行頭文字が、「」（スペース）、「（」（開き括弧）、「？」（疑問符）、「数字」、「・」（中黒）、のいずれでもない場合は、自動的に挿入された改行と判断して改行を削除する。また、改行データの一つ前の文字が、句読点、「）」（閉じ括弧）のいずれでもない場合は、自動的に挿入された改行と判断して改行を削除する。 However, in Patent Document 1, the data after detecting the quote symbol at the beginning of the line is inspected, and if the second or later quote symbol is detected, it is deleted. When line feed data is detected, the next character of the line feed data, that is, the bullet, is “” (space), “(” (open parenthesis), “?” (Question mark), “number”, “•”. (Inside of black), it is judged as an automatically inserted line break and the line break is deleted, and the previous character of the line break data is a punctuation mark, ")" (close bracket) If it is neither, it is determined that the line was inserted automatically and the line break is deleted.

一方、自動的に挿入された改行ではないと判断した場合は、引用の終端を意味するコードを付与し、画面に表示する場合に、現在の画面幅に合わせた位置で改行を行う。そして、引用部分に相当する文の場合は改行コードに続いて引用記号を付与することで、メール文書の可読性を向上させるというものである。従って、引用の階層が異なる場合でも、同じ階層のデータとして認識されてしまうという問題がある。また、下記に示すようなメールの本文データＡ１では、数字で始まるため「２」の前の自動的に挿入された改行が削除されないという問題が発生する。 On the other hand, if it is determined that it is not an automatically inserted line break, a code meaning the end of citation is added, and when displaying on the screen, a line break is made at a position that matches the current screen width. In the case of a sentence corresponding to a quoted portion, the readability of the mail document is improved by adding a quote symbol after the line feed code. Therefore, there is a problem that even when the citation layers are different, they are recognized as data of the same layer. Further, since the mail body data A1 as shown below starts with a number, there is a problem in that the automatically inserted line feed before “2” is not deleted.

［メールの本文データＡ１］

本発明の目的は、複数回のメールのやり取りが行われ引用階層が乱れた場合でも、自動改行データが挿入された場合の規則性を判定することにより、より正確な文の単位が認識できるようにメールの本文データを整形できる、文書解析プログラムおよび文書解析方法を提供することである。 [Mail body data A1]

It is an object of the present invention to recognize a more accurate sentence unit by determining regularity when automatic line feed data is inserted even when a mail is exchanged multiple times and the citation hierarchy is disturbed. To provide a document analysis program and a document analysis method capable of formatting mail body data.

本発明に係わる文書解析装置は、入力装置から入力されたメールの本文データを改行ごとに読み込み各行文字列の文字の数に基づき各行ごとにデータ長を判定するとともに各行文字列の引用記号の数に基づいて各行ごとに引用階層を判定する文書階層判定部と、前記文書階層判定部で判定したデータ長のうちデータ長が最大値の予め定めた範囲内であるデータ長近似最大行の引用階層が次の行の引用階層より大きい場合は前記データ長近似最大行の文字列に含まれる引用記号及び改行を取り除いた文字列と次の行の文字列に含まれる引用記号及び改行を取り除いた文字列との結合文字列の長さが前記データ長の最大値の予め定めた範囲内かどうかを判定し前記結合文字列の長さが前記データ長の最大値の予め定めた範囲内であるときには前記データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定する同一階層判定部と、前記同一階層判定部で同一階層と判定された前記データ長近似最大行とその次の行の文字列を結合して結合文字列とし前記データ長近似最大行の引用階層を表す引用記号を前記結合文字列に付与して新たな行として出力装置に出力する文書整形部とを備える。 The document analysis apparatus according to the present invention reads the mail body data input from the input device for each line feed , determines the data length for each line based on the number of characters in each line character string, and determines the number of quotation marks in each line character string. A document hierarchy determination unit that determines a citation hierarchy for each line based on the document hierarchy, and a citation hierarchy of a data length approximate maximum line that has a data length within a predetermined range of a maximum value among the data lengths determined by the document hierarchy determination unit If is greater than the citation hierarchy of the next line, the character string from which the quotation mark and line feed are removed from the character string of the maximum approximate data length line and the character from which the quotation mark and line feed are removed from the character string of the next line when the length of the coupling length of the string of binding strings and string to determine whether the predetermined range of the maximum value of the data length is within a predetermined range of the maximum value of the data length Above The same hierarchy determination unit that determines that the actual hierarchy of the data length approximate maximum line and the next line is a character string of the same hierarchy, and the data length approximation maximum that is determined to be the same hierarchy by the same hierarchy determination unit A document format that combines a line and the character string of the next line to form a combined character string, adds a quoting symbol representing the citation hierarchy of the line with the approximate maximum data length to the combined character string, and outputs it to the output device as a new line A part .

本発明によれば、複数回のメールのやり取りが行われ引用階層が乱れた場合でも、自動改行データが挿入された場合の規則性を判定することにより、より正確な文の単位が認識できるようにメールの本文データを整形できる。 According to the present invention, it is possible to recognize a more accurate sentence unit by determining regularity when automatic line feed data is inserted even when a plurality of mails are exchanged and the citation hierarchy is disturbed. You can format mail body data.

図１は本発明の実施の形態に係わる文書解析装置１１の機能ブロック図であり、図２は本発明の実施の形態に係る文書解析装置のハードウエア構成を示すブロック構成図である。 FIG. 1 is a functional block diagram of a document analysis apparatus 11 according to the embodiment of the present invention, and FIG. 2 is a block configuration diagram showing a hardware configuration of the document analysis apparatus according to the embodiment of the present invention.

図２において、文書解析装置１１は、例えば一般的なコンピュータに文書解析プログラムなどのソフトウェアプログラムがインストールされ、そのソフトウェアプログラムが演算制御装置１２のプロセッサ１３において実行されることにより実現される。 In FIG. 2, the document analysis device 11 is realized by installing a software program such as a document analysis program in a general computer and executing the software program in the processor 13 of the arithmetic control device 12.

演算制御装置１２は文書解析に関する各種演算を行うものであり、演算制御装置１２はプロセッサ１３とメモリ１４とを有し、メモリ１４には翻訳に関する文書解析プログラム１５が記憶され、プロセッサ１３により処理が実行される際には作業エリア１６が用いられる。演算制御装置１２の演算結果等は出力装置１７である表示装置１８に表示出力され、また、通信制御装置１９を介して通信ネットワークに出力される。 The arithmetic control device 12 performs various arithmetic operations related to document analysis. The arithmetic control device 12 includes a processor 13 and a memory 14. A memory 14 stores a document analysis program 15 related to translation, and the processor 13 performs processing. When executed, the work area 16 is used. Calculation results and the like of the calculation control device 12 are displayed and output on the display device 18 that is the output device 17 and also output to the communication network via the communication control device 19.

入力装置２０は演算制御装置１２に情報を入力するものであり、例えば、マウス２１、キーボード２２、ディスクドライブ２３、通信制御装置１９から構成され、例えば、マウス２１やキーボード２２は表示装置１８を介して演算制御装置１２に各種指令を入力し、キーボード２２、ディスクドライブ２３、通信制御装置１９はメールの本文データを入力する。 The input device 20 is used to input information to the arithmetic control device 12, and includes, for example, a mouse 21, a keyboard 22, a disk drive 23, and a communication control device 19. For example, the mouse 21 and the keyboard 22 are connected via the display device 18. Then, various commands are input to the arithmetic and control unit 12, and the keyboard 22, the disk drive 23, and the communication control unit 19 input mail body data.

すなわち、ディスクドライブ２３はメールの本文データを記憶媒体に入出力するものであり、通信制御装置１９は文書解析装置１１をインターネットやＬＡＮなどの通信ネットワークに接続するものである。通信制御装置１９はＬＡＮカードやモデムなどの装置であり、通信制御装置１９を介して通信ネットワークと送受信したデータは入力信号又は出力信号として演算制御装置１２に送受信される。さらに、演算制御装置１２の演算結果やメールツールのプログラム等を記憶するハードディスクドライブ（ＨＤＤ）２４が設けられている。 That is, the disk drive 23 inputs / outputs mail text data to / from a storage medium, and the communication control device 19 connects the document analysis device 11 to a communication network such as the Internet or a LAN. The communication control device 19 is a device such as a LAN card or a modem, and data transmitted / received to / from the communication network via the communication control device 19 is transmitted / received to / from the arithmetic control device 12 as an input signal or an output signal. Furthermore, a hard disk drive (HDD) 24 is provided for storing the calculation results of the calculation control device 12, the mail tool program, and the like.

図１は本発明の実施の形態に係わる文書解析装置１１の機能ブロック図である。図２に示す演算制御装置１２内の各機能ブロックは、上述の文書解析プログラム１５を構成する各プログラムに対応する。すなわち、プロセッサ１３が文書解析プログラム１５を構成する各プログラムを実行することで、演算制御装置１２は、各機能ブロックとして機能することとなる。また、記憶装置２５のブロックは、演算制御装置１２内のメモリ１４及びハードディスクドライブ２４の記憶領域に対応する。 FIG. 1 is a functional block diagram of a document analysis apparatus 11 according to an embodiment of the present invention. Each functional block in the arithmetic and control unit 12 shown in FIG. 2 corresponds to each program constituting the document analysis program 15 described above. That is, when the processor 13 executes each program constituting the document analysis program 15, the arithmetic and control unit 12 functions as each functional block. The block of the storage device 25 corresponds to the storage area of the memory 14 and the hard disk drive 24 in the arithmetic control device 12.

入力処理部２６は、外部との入力のインターフェースを行うものであり、インターネットなどの通信制御装置１９やキーボード２３などの入力装置２０を通じてメールの本文データやコマンドを受け取るものである。 The input processing unit 26 performs an input interface with the outside, and receives mail text data and commands through the communication control device 19 such as the Internet and the input device 20 such as the keyboard 23.

出力処理部２７は、外部との出力のインターフェースを行うものであり、インターネットなどの通信制御装置１９や表示装置１８などの出力装置１７を通じてメールの本文データを出力するものである。 The output processing unit 27 performs an output interface with the outside, and outputs mail body data through the communication control device 19 such as the Internet and the output device 17 such as the display device 18.

制御部２８は装置全体の制御を行うものであり、入力処理部２６から送られたメールの本文データを記憶装置２５のデータ一時記憶部２９に記憶したり、文書階層判定部３０、同一階層判定部３１、文書整形部３２、文認識処理部３３を制御したり、これらの演算結果を記憶したデータ一時記憶部２９の内容を出力処理部２７に送り出力装置１７に出力したりする。 The control unit 28 controls the entire apparatus. The control unit 28 stores the text data of the mail sent from the input processing unit 26 in the data temporary storage unit 29 of the storage device 25, the document hierarchy determination unit 30, and the same hierarchy determination. The unit 31, the document shaping unit 32, and the sentence recognition processing unit 33 are controlled, and the contents of the data temporary storage unit 29 storing these calculation results are sent to the output processing unit 27 and output to the output device 17.

文書階層判定部３０は、入力処理部２６を介して入手されたメールの本文データの階層を判定するものである。すなわち、メールの本文データの各行文字列の文字の数に基づき各行ごとにデータ長を判定するとともに、各行文字列の引用記号の数に基づいて各行ごとに引用階層を判定する。そして、各行ごとの文字列、データ長及び引用階層を記憶装置２５のデータ一時記憶部２９に記憶する。 The document hierarchy determination unit 30 determines the hierarchy of the body data of the mail obtained through the input processing unit 26. That is, the data length is determined for each line based on the number of characters in each line character string of the mail body data, and the citation hierarchy is determined for each line based on the number of quotation marks in each line character string. Then, the character string, the data length, and the citation hierarchy for each line are stored in the data temporary storage unit 29 of the storage device 25.

同一階層判定部３１は、隣接する行の引用階層が同一階層かどうかを判定するものである。この判定の仕方については後述する。 The same hierarchy determination unit 31 determines whether the citation hierarchy of adjacent lines is the same hierarchy. This determination method will be described later.

文書整形部３２は、同一階層判定部３１で同一階層と判定された行の文字列を結合して結合文字列とし、その引用階層を表す引用記号を結合文字列に付与して新たな行として記憶装置２５のデータ一時記憶部２９に記憶するものである。 The document shaping unit 32 combines the character strings of the lines determined to be the same hierarchy by the same hierarchy determination unit 31 to form a combined character string, and adds a quotation mark representing the citation hierarchy to the combined character string as a new line. The data is stored in the temporary data storage unit 29 of the storage device 25.

文認識処理部３３は、文書整形部３２により新たに付与された行を含むメールの本文データの隣接する同一階層の行を結合し、結合文字列内の句点の位置で一文と判定し、その一文ごとの文の先頭にその引用階層の引用記号を付した文字列を作成し、記憶装置２５のデータ一時記憶部２９に記憶するものである。 The sentence recognition processing unit 33 combines adjacent lines in the same hierarchy of the body data of the mail including the line newly given by the document shaping unit 32, determines a sentence at the position of the punctuation in the combined character string, A character string in which a quotation mark of the quotation hierarchy is added to the head of each sentence is created and stored in the data temporary storage unit 29 of the storage device 25.

次に、本発明の実施の形態に係わる文書解析装置１１の実施例１の動作について説明する。図３は、本発明の実施の形態に係わる文書解析装置１１の実施例１の動作を示すフローチャートである。 Next, operation | movement of Example 1 of the document analysis apparatus 11 concerning embodiment of this invention is demonstrated. FIG. 3 is a flowchart showing the operation of Example 1 of the document analysis apparatus 11 according to the embodiment of the present invention.

入力装置２０から演算処理装置１２にメールの本文データが入力されると、制御部２８は入力処理部２６を起動し、入力処理部２６はメールの本文データの改行毎にデータを読み込む（Ｓ２０１）。いま、下記のようなメールの本文データＡ２が読み込まれたとする。 When mail text data is input from the input device 20 to the arithmetic processing unit 12, the control unit 28 activates the input processing unit 26, and the input processing unit 26 reads data for each line feed of the mail text data (S201). . Assume that the following mail text data A2 is read.

［メールの本文データＡ２］

このメールの本文データＡ２は、以下に示すメールの本文データＡ３がメールツールの設定により全角２０文字の位置で自動的に改行され、１度目の返信時に「はい。」が挿入され、さらに２度の転送が行われたものである。 [Mail body data A2]

In the mail body data A2, the mail body data A3 shown below is automatically broken at a position of 20 full-width characters according to the setting of the mail tool, and “Yes” is inserted at the first reply, and then twice. Has been transferred.

［メールの本文データＡ３］

メールの本文データＡ２及びメールの本文データＡ３から分かるように、メールの本文データＡ２は、本来同一行であった文字が自動改行により異なる行になった上、引用階層も異なった状態になってしまったものである。 [Mail body data A3]

As can be seen from the mail body data A2 and the mail body data A3, the mail body data A2 has different characters due to the automatic line feed, and the citation hierarchy is also different. It is a fool.

入力処理部２６により読み込まれたメールの本文データＡ２は記憶装置２５のデータ一時記憶部２９に記憶される。表１は、記憶装置２５のデータ一時記憶部２９に記憶されたメールの本文データＡ２のテーブルの一例を示している。

The mail body data A2 read by the input processing unit 26 is stored in the data temporary storage unit 29 of the storage device 25. Table 1 shows an example of a table of mail body data A2 stored in the data temporary storage unit 29 of the storage device 25.

表１に示すように、配列番号はメールの本文データの各行毎のデータに付された番号であり、データは本文データの各行毎の文字列である。長さは本文データの各行毎のデータの長さ（文字列の数）であり、引用階層はメールの返信の回数を示す階層である。表１では、メールの本文データＡ２を読み込んだままデータを記憶しているので、長さ及び引用階層は空白となっている。 As shown in Table 1, the array element number is a number assigned to each line of mail body data, and the data is a character string for each line of body data. The length is the data length (number of character strings) for each line of the body data, and the citation hierarchy is a hierarchy indicating the number of mail replies. In Table 1, since the data is stored while the mail body data A2 is read, the length and citation hierarchy are blank.

メールの本文データＡ２が改行毎に読み込まれ、記憶装置２５のデータ一時記憶部２９に表１のように記憶されると、制御部２８は文書階層判定部３０が起動され、文書階層判定が行われる（Ｓ２０２）。 When the mail body data A2 is read for each line feed and stored in the data temporary storage unit 29 of the storage device 25 as shown in Table 1, the control unit 28 activates the document hierarchy determination unit 30 to perform document hierarchy determination. (S202).

図４は、文書階層判定部３０の文書階層判定の処理内容を示すフローチャートである。文書階層判定部３０での文書階層判定は、メールの本文データの階層を判定するものである。 FIG. 4 is a flowchart showing the processing contents of document hierarchy determination of the document hierarchy determination unit 30. The document hierarchy determination in the document hierarchy determination unit 30 is to determine the hierarchy of the mail body data.

まず、文書階層判定部３０は、行データ指定変数ｉに初期値「１」をセットし（Ｓ３０１）、記憶装置２５のデータ一時記憶部２９に記憶されているメールの本文データＡ２の配列番号がｉ番目のデータＳｉを取得する（Ｓ３０２）。ｉが「１」であるときは、データＳ１として（>>> メールツールの設定が、「メール送信時）が取得される。 First, the document hierarchy determination unit 30 sets an initial value “1” to the line data designation variable i (S301), and the array number of the mail body data A2 stored in the data temporary storage unit 29 of the storage device 25 is set. The i-th data Si is acquired (S302). When i is “1”, “>>> mail tool setting is“ at the time of mail transmission ”is acquired as data S1.

次に、引用階層を示す変数Ｌ、データＳｉの各要素を指定する変数Ｎ、引用記号配列の位置を示す変数Ｍに初期値「１」をセットする（Ｓ３０３）。引用階層を示す変数Ｌはメールの返信の回数を示す変数であり、データＳｉの各要素を指定する変数ＮはデータＳｉの要素（文字列）の位置を指定する変数であり、引用記号配列の位置を示す変数Ｍは引用記号を指定するための変数である。表２に引用記号配列のテーブルの一例を示す。

Next, an initial value “1” is set in a variable L indicating the citation hierarchy, a variable N designating each element of the data Si, and a variable M indicating the position of the citation symbol array (S303). A variable L indicating the quoting hierarchy is a variable indicating the number of mail replies. A variable N for specifying each element of the data Si is a variable for specifying a position of an element (character string) of the data Si. A variable M indicating a position is a variable for designating a quotation mark. Table 2 shows an example of a table of quoted symbol arrays.

引用記号配列の位置を示す変数Ｍが「１」であるときは、引用記号として「>」が取得され、引用記号配列の位置を示す変数Ｍが「２」であるときは、引用記号として「|」が取得される。 When the variable M indicating the position of the reference symbol array is “1”, “>” is acquired as the reference symbol, and when the variable M indicating the position of the reference symbol array is “2”, the reference symbol “ | "Is acquired.

次に、引用記号配列の位置を示す変数Ｍが指示する位置の引用記号ＣＭを取得する（３０４）。そして、変数Ｎから始まるデータＳｉの文字は引用記号ＣＭと一致したかどうかを判定し（Ｓ３０５）、一致しない場合は、引用記号ＣＭは最後のデータか判定し（Ｓ３０６）、引用記号ＣＭが最後のデータではない場合はＭ＝Ｍ＋１を行い（Ｓ３０７）、ステップＳ３０４に戻る。これにより、ステップＳ３０４では、引用記号ＣＭとして「|」が取得されることになる。このようにして、データＳｉの先頭からデータＳｉの文字が表２に示される引用記号配列に格納されている各引用記号と一致するかどうかを判定していくことになる。 Next, the quote symbol CM at the position indicated by the variable M indicating the position of the quote symbol array is acquired (304). Then, it is determined whether or not the character of the data Si starting from the variable N matches the quote symbol CM (S305). If not, it is determined whether the quote symbol CM is the last data (S306), and the quote symbol CM is the last one. If not, M = M + 1 is performed (S307), and the process returns to step S304. Thereby, in step S304, “|” is acquired as the quotation symbol CM. In this way, it is determined from the beginning of the data Si whether or not the character of the data Si matches each quote symbol stored in the quote symbol array shown in Table 2.

いま、ｉ、Ｌ、Ｎ、Ｍがいずれも「１」であるとする。ｉが「１」である１行目のデータＳ１は（>>> メールツールの設定が、「メール送信時）である。ステップＳ３０４の判定においては、変数Ｍは１であるので、引用記号ＣＭは「>」であり、変数Ｎ（＝１）が示す位置のデータＳｉ（Ｓ１）が示す文字は「>」である。従って、ステップＳ３０５での判定では、変数Ｎ（＝１）から始まるデータＳｉ（Ｓ１）の文字「>」は引用記号ＣＭ（＝>）と一致するので、ステップＳ３０８に進む。 Assume that i, L, N, and M are all “1”. The data S1 in the first row where i is “1” is (>>> the mail tool setting is “at the time of mail transmission.” In the determination of step S304, the variable M is 1, so the quotation mark CM Is “>”, and the character indicated by the data Si (S1) at the position indicated by the variable N (= 1) is “>”. Accordingly, in the determination in step S305, the character “>” of the data Si (S1) starting from the variable N (= 1) matches the quotation mark CM (=>), and the process proceeds to step S308.

ステップＳ３０８では、データＳｉの判定開始位置を示す変数Ｎに、一致した引用記号ＣＭ「>」の長さを加える。これにより、変数Ｎを次の判定開始位置に更新する。また、引用階層を示す変数Ｌに１を加えてＬ＋１とし、変数Ｌを次の引用階層に更新する。なお、引用記号配列の位置を示す変数ＭはＭ＝１のままとする。変数Ｍの更新はステップＳ３０７で行われるからである。また、データＳｉのＮ番目の位置がスペースである場合には、変数Ｎに１を加えてＮ＋１とし、変数Ｎを次の判定開始位置に更新する。 In step S308, the length of the matching reference symbol CM “>” is added to the variable N indicating the determination start position of the data Si. Thereby, the variable N is updated to the next determination start position. Further, 1 is added to the variable L indicating the citation hierarchy to obtain L + 1, and the variable L is updated to the next citation hierarchy. Note that the variable M indicating the position of the reference symbol array remains M = 1. This is because the variable M is updated in step S307. If the Nth position of the data Si is a space, 1 is added to the variable N to obtain N + 1, and the variable N is updated to the next determination start position.

そして、ステップＳ３０４の処理に戻り、ステップＳ３０５においてデータＳｉの変数Ｎの位置の文字が引用記号ＣＭと一致しなくなるまで、ステップＳ３０４、Ｓ３０５、Ｓ３０８を繰り返し行う。ステップＳ３０５においてデータＳｉの変数Ｎの位置の文字が引用記号ＣＭと一致しなくなるとステップＳ３０６に移行し、引用記号ＣＭが最後のデータになるまで、ステップＳ３０４、Ｓ３０５、Ｓ３０８を繰り返し行う。 Then, returning to the process of step S304, steps S304, S305, and S308 are repeated until the character at the position of the variable N in the data Si does not match the quote symbol CM in step S305. If the character at the position of the variable N in the data Si does not match the quote symbol CM in step S305, the process proceeds to step S306, and steps S304, S305, and S308 are repeated until the quote symbol CM becomes the last data.

ステップＳ３０６の判定で、引用記号ＣＭが最後のデータになったと判定されると、ステップＳ３０９に進む。ステップＳ３０９では、メールツールによっては引用記号ＣＭとともにスペースが挿入される場合があるため、スペースに対する評価を行う。ＳｉデータのＮ番目がＮ≠１、かつＮ番目の文字はスペース、かつＮ−１番目の文字はスペースではないかどうかを判定し、そうである場合には、スペースは引用記号として扱いＳ３０８に進む。ステップＳ３０９の条件を満たさない場合は、引用階層を示す変数ＬとともにデータＳｉ、データＳｉの長さをデータ一時記憶部２９に記憶する（Ｓ３１０）。これにより、表３に示すように、メールの本文データの１行目のデータＳ１については、データＳ１の長さは４０、引用階層は４が記憶される。データＳ１の長さは半角文字数で示され、引用階層は引用記号ＣＭの数とスペースとの合計で示されている。

If it is determined in step S306 that the quotation mark CM is the last data, the process proceeds to step S309. In step S309, depending on the mail tool, a space may be inserted together with the quote symbol CM, and therefore the space is evaluated. It is determined whether or not the Nth character of the Si data is N ≠ 1, the Nth character is a space, and the N−1th character is not a space. If so, the space is treated as a quotation mark and the process proceeds to S308. move on. If the condition of step S309 is not satisfied, the data Si and the length of the data Si are stored in the data temporary storage unit 29 together with the variable L indicating the citation hierarchy (S310). As a result, as shown in Table 3, for the data S1 in the first line of the mail body data, the length of the data S1 is 40 and the citation hierarchy is 4. The length of the data S1 is indicated by the number of single-byte characters, and the citation hierarchy is indicated by the sum of the number of quotation marks CM and the space.

次に、データＳｉは最後の行かどうかを判定し（Ｓ３１１）、最後の行でないときは、行データ指定変数ｉに１を加算して（Ｓ３１２）、ステップＳ３０２に戻る。これにより、２行目以降に対しても同様の処理を行い、最後の行に達すると処理を終了する。このとき、記憶装置２５のデータ一時記憶部２９に記憶されたメールの本文データＡ２のテーブルは、表４に示すように、各行のＳｉに対して、データＳ１の長さ及び引用階層が記憶される。

Next, it is determined whether or not the data Si is the last line (S311). If it is not the last line, 1 is added to the line data designation variable i (S312), and the process returns to step S302. Thus, the same processing is performed for the second and subsequent rows, and the processing is terminated when the last row is reached. At this time, the table of the mail body data A2 stored in the data temporary storage unit 29 of the storage device 25 stores the length of the data S1 and the citation hierarchy for Si in each row as shown in Table 4. The

このようにして、文書階層判定部３０によりメールの本文データの階層が判定されると、同一階層判定部３１は、隣接する行の引用階層が同一階層かどうかを判定する。同一階層判定部３１は、図３のステップＳ２０３において、記憶装置２５のデータ一時記憶部２９に記憶された表４に示すメールの本文データＡ２のテーブルより、メールの本文データＡ２のデータＳｉのうち引用階層の最大値ｎ及びデータ長の最大値ＭＡＸを取得しデータ一時記憶部２９に記憶する。また、行データ指定変数ｉに「１」をセットし、ループ変数Ｊに「１」をセットする（Ｓ２０３）。 In this way, when the hierarchy of the mail body data is determined by the document hierarchy determination unit 30, the same hierarchy determination unit 31 determines whether the citation hierarchy of adjacent lines is the same hierarchy. In step S203 of FIG. 3, the same hierarchy determination unit 31 uses the data Si of the mail body data A2 from the mail body data A2 table shown in Table 4 stored in the data temporary storage unit 29 of the storage device 25. The maximum value n of the citation hierarchy and the maximum value MAX of the data length are acquired and stored in the data temporary storage unit 29. Further, “1” is set to the row data designation variable i, and “1” is set to the loop variable J (S203).

続いて、同一階層判定部３１は、表４に示すメールの本文データＡ２より、ｉ番目のデータＳｉと引用階層Ｌｉを取得する(Ｓ２０４)。そして、取得したデータＳｉの長さがデータ長最大値ＭＡＸの近似範囲にあるデータ長近似最大行かどうかを判定する。すなわち、同一階層判定部３１は、データＳｉの長さ＞ＭＡＸ−補正値であるかどうかを判断する（Ｓ２０５）。ここで、補正値はデータ長最大値ＭＡＸの±２を近似値として判断する。この補正値はユーザにより設定できるようにしてもよい。補正値を設定するのは、例えば、引用階層の一部にスペースが挿入されたり、英語の場合には単語間にスペースが入るので、その影響を考慮するためである。 Subsequently, the same hierarchy determination unit 31 acquires the i-th data Si and the quotation hierarchy Li from the mail body data A2 shown in Table 4 (S204). Then, it is determined whether or not the length of the acquired data Si is the data length approximate maximum row within the approximate range of the data length maximum value MAX. That is, the same hierarchy determination unit 31 determines whether or not the length of the data Si> MAX−correction value (S205). Here, the correction value is determined by using ± 2 of the data length maximum value MAX as an approximate value. This correction value may be set by the user. The reason why the correction value is set is, for example, that a space is inserted in a part of the citation hierarchy or a space is inserted between words in the case of English.

ステップＳ２０５の判定で、データＳｉの長さがデータ長近似最大行であると判定されたときは、同一階層判定部３１は、次の行であるｉ＋１行目のデータＳｉ＋１が取得できるかどうかを判定し（Ｓ２０６）、データＳｉ＋１が取得できるとときは、次の行であるｉ＋１行目のデータＳｉ＋１と引用階層Ｌｉ＋１とを取得する（Ｓ２０７）。 When it is determined in step S205 that the length of the data Si is the data length approximate maximum row, the same hierarchy determination unit 31 determines whether or not the data Si + 1 in the i + 1th row which is the next row can be acquired. When the determination is made (S206) and the data Si + 1 can be acquired, the data Si + 1 and the citation hierarchy Li + 1 in the i + 1th row which is the next row are acquired (S207).

そして、同一階層判定部３１は、データ長近似最大行であるデータＳｉの引用階層ＬｉがデータＳｉ＋１の引用階層Ｌｉ＋１よりも大きく、かつ、データＳｉとデータＳｉ＋１との引用記号部分を取り除いた長さを加えた値が、データ長最大値ＭＡＸの近似値かを判定する。すなわち、Ｌｉ＞Ｌｉ＋１かつ引用記号を取り除いたＳｉとＳｉ＋１の長さがＭＡＸ近似値内かどうかを判定する（Ｓ２０８）。ステップＳ２０８の条件を満たした場合には、同一階層判定部３１は、データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定する。これは、データＳｉ＋１はもともとデータＳｉの一部であったが自動改行により改行が挿入されたデータと判定できるからである。 Then, the same hierarchy determination unit 31 has a length obtained by removing the quote symbol portion of the data Si and the data Si + 1 in which the quote hierarchy Li of the data Si that is the data length approximate maximum row is larger than the quote hierarchy Li + 1 of the data Si + 1. It is determined whether the value obtained by adding is an approximate value of the data length maximum value MAX. That is, it is determined whether or not the lengths of Si and Si + 1 with Li> Li + 1 and the quotation marks removed are within the MAX approximate value (S208). When the condition of step S208 is satisfied, the same hierarchy determination unit 31 determines that the actual hierarchy of the data length approximate maximum line and the next line is a character string of the same hierarchy. This is because the data Si + 1 was originally a part of the data Si, but it can be determined that the line feed is inserted by automatic line feed.

次に、文書整形部３２は、Ｓｉ＋１の引用記号を取り除きＳｉに結合する(Ｓ２０９)。すなわち、データＳｉ＋１の先頭からの引用記号を取り除き、データＳｉのデータの最後に結合する。この場合の引用記号には、引用記号に付加される引用記号に続く最初のスペースも含まれる。 Next, the document shaping unit 32 removes the quotation mark of Si + 1 and combines it with Si (S209). That is, the quotation mark from the head of the data Si + 1 is removed and combined with the end of the data of the data Si. The quote symbol in this case also includes the first space following the quote symbol appended to the quote symbol.

そして、文書整形部３２は、データＳｉ＋１が最後のデータであるかどうかを判定し（Ｓ２１０）、最後のデータでない場合には、ｉ＝ｉ＋２とし（Ｓ２１１）、ステップＳ２０４に戻る。文書整形部３２は、データＳｉが最後のデータである場合はＪ＝Ｊ＋１とし（Ｓ２１２）、ループ変数Ｊが引用階層の最大値ｎと等しいかどうかを判断する。すなわち、Ｊ＝ｎかどうかを判断し（Ｓ２１３）、Ｊ＝ｎでないときはｉ＝１とし（Ｓ２１４）、ステップＳ２０４に戻り、ステップＳ２０４〜Ｓ２１４の処理を繰り返し行う。 Then, the document shaping unit 32 determines whether or not the data Si + 1 is the last data (S210). If it is not the last data, i = i + 2 is set (S211), and the process returns to step S204. If the data Si is the last data, the document shaping unit 32 sets J = J + 1 (S212), and determines whether the loop variable J is equal to the maximum value n of the citation hierarchy. That is, it is determined whether J = n (S213). If J = n is not satisfied, i = 1 is set (S214), the process returns to step S204, and the processes of steps S204 to S214 are repeated.

なお、ステップＳ２０５の判定でデータＳｉの長さがデータ長近似最大行でないと判定されたとき、ステップ２０８の判定でステップＳ２０８の条件を満たさなかった場合はステップＳ２１０に進む。また、ステップＳ２０６の判定でＳｉ＋１が取得できなかったときはステップＳ２１２に進む。 When it is determined in step S205 that the length of the data Si is not the data length approximate maximum row, if the determination in step 208 does not satisfy the condition in step S208, the process proceeds to step S210. On the other hand, if Si + 1 cannot be acquired in the determination in step S206, the process proceeds to step S212.

ここで、１回目のステップＳ２０４〜Ｓ２１４の処理が完了した時点では、表４に示すメールの本文データＡ２の配列番号１、２、配列番号４、５、配列番号７、８がそれぞれ結合され、表５に示すようなメールの本文データとなる。

Here, when the processing of the first steps S204 to S214 is completed, the

array numbers

1, 2, array numbers 4, 5, and array numbers 7, 8 of the mail body data A2 shown in Table 4 are combined, respectively. The mail body data is as shown in Table 5.

さらに、ループ変数Ｊが引用階層の最大値ｎと等しくなるまで、Ｓ２０４〜Ｓ２１４の処理を繰り返すことにより、表６に示すようなメールの本文データとなり、自動改行により乱れた引用階層が整形される。

Further, by repeating the processing of S204 to S214 until the loop variable J becomes equal to the maximum value n of the citation hierarchy, the mail body data shown in Table 6 is obtained, and the citation hierarchy disturbed by the automatic line feed is shaped. .

このように、文書整形部３２は同一階層と判定されたデータ長近似最大行とその次の行の文字列とを結合して結合文字列とし、データ長近似最大行の引用階層を表す引用記号を結合文字列に付与して新たな行として、記憶装置２５のデータ一時記憶部２９に記憶する。 In this way, the document shaping unit 32 combines the maximum data length approximated line determined to be the same hierarchy and the character string of the next line into a combined character string, and a citation symbol representing the citation hierarchy of the data length approximate maximum line Is added to the combined character string and stored in the data temporary storage unit 29 of the storage device 25 as a new line.

続いて、表６に示す引用階層が整形されたメールの本文データに対して文認識処理が行われる（Ｓ２１５）。図５は、文認識処理部３３の文認識処理の処理内容を示すフローチャートである。文認識処理部３３での文認識処理は、文書整形部３２により新たに付与された行を含むメールの本文データの隣接する同一階層の行を結合し、文の文字列とするものである。 Subsequently, sentence recognition processing is performed on the body data of the mail in which the citation hierarchy shown in Table 6 is shaped (S215). FIG. 5 is a flowchart showing the processing content of the sentence recognition processing of the sentence recognition processing unit 33. The sentence recognition processing in the sentence recognition processing unit 33 is to combine adjacent lines in the same hierarchy in the body data of the mail including the line newly given by the document shaping unit 32 to form a character string of the sentence.

表６に示す文書整形されたメールの本文データの各行からデータＳｉを取り出すために、図５に示すように、まず、行データ指定変数ｉに初期値「１」をセットする（Ｓ４０１）。 In order to extract the data Si from each line of the text data of the document-shaped mail shown in Table 6, first, an initial value “1” is set to the line data designation variable i as shown in FIG. 5 (S401).

続いてｉ番目のデータＳｉを取得し、取得したデータＳｉの引用記号を取り除いてデータ一時記憶部２９の文認識処理記憶領域に格納する。すなわち、データＳｉから引用記号を取り除いてデータ一時記憶部に格納する（Ｓ４０２）。そして、取り除いた引用記号の引用階層Ｌｉをデータ一時記憶部２９に記憶する(Ｓ４０３)。この取り除いた引用記号の引用階層Ｌｉはデータ一時記憶部２９の引用階層記憶領域に記憶しておくことになる。次に、ｉ番目の位置から引用階層が異なるデータが出現するまで処理を繰り返すためのループカウンタの役割を行う変数ｐに初期値「１」をセットする（Ｓ４０４）。 Subsequently, the i-th data Si is acquired, the quotation mark of the acquired data Si is removed, and the i-th data Si is stored in the sentence recognition processing storage area of the temporary data storage unit 29. That is, the quotation mark is removed from the data Si and stored in the data temporary storage unit (S402). Then, the citation hierarchy Li of the removed citation symbol is stored in the data temporary storage unit 29 (S403). The citation hierarchy Li of the removed citation symbol is stored in the citation hierarchy storage area of the data temporary storage unit 29. Next, an initial value “1” is set to a variable p that serves as a loop counter for repeating the process from the i-th position until data having a different citation hierarchy appears (S404).

続いてデータＳｉ＋ｐの引用記号の引用階層は取得できたかどうかを判定し（Ｓ４０５）、データＳｉ＋ｐの引用記号の引用階層が取得できなかったときは処理を終了する。一方、データＳｉ＋ｐの引用記号の引用階層が取得できたときは、データＳｉの引用階層ＬｉとデータＳｉ＋ｐの引用階層とが同じかどうかを判定する（Ｓ４０６）。引用階層が同じである場合は、データＳｉ＋ｐの引用記号の文字列を削除し、データＳｉ＋ｐの末尾に改行コードを付与してデータ一時記憶部２９に記憶する（Ｓ４０７）。これらのデータは、データ一時記憶部２９の文認識処理記憶領域に追加して記憶されることになる。そして、次のデータを取得するために変数ｐをｐ＝ｐ＋１とし（Ｓ４０８）、ステップＳ４０５に戻る。 Subsequently, it is determined whether or not the citation hierarchy of the citation symbol of the data Si + p can be acquired (S405). If the citation hierarchy of the citation symbol of the data Si + p cannot be acquired, the process ends. On the other hand, when the citation hierarchy of the citation symbol of the data Si + p is acquired, it is determined whether the citation hierarchy Li of the data Si and the citation hierarchy of the data Si + p are the same (S406). If the citation hierarchies are the same, the character string of the citation symbol of the data Si + p is deleted, a line feed code is added to the end of the data Si + p, and stored in the data temporary storage unit 29 (S407). These data are additionally stored in the sentence recognition processing storage area of the data temporary storage unit 29. Then, to obtain the next data, the variable p is set to p = p + 1 (S408), and the process returns to step S405.

ステップＳ４０６の判定で、引用階層が異なるデータＳｉが出現した場合は、文認識処理記憶領域のデータを基に文の判定処理を行う（Ｓ４０９）。この場合、
文認識処理記憶領域のデータは、表７に示すように、引用階層が同じデータが格納されたメール本文データとなっている。

If it is determined in step S406 that data Si having a different citation hierarchy appears, sentence determination processing is performed based on the data in the sentence recognition processing storage area (S409). in this case,
As shown in Table 7, the data in the sentence recognition processing storage area is mail text data in which data having the same citation hierarchy is stored.

ステップＳ４０９での文判定処理では、例えば、「。」などの句点を文末と判断して、「メールツールの設定が…になっている。」を切り出す。続いて、「全」から次に出現する「。」までを一文とする。続いて「以上」を一文とする。 In the sentence determination processing in step S409, for example, a phrase such as “.” Is determined as the end of the sentence, and “Mail tool setting is... Subsequently, a sentence from “all” to “.” That appears next is taken as one sentence. Next, “above” is a sentence.

最後まで達したら、文毎の先頭に引用記号を挿入する（Ｓ４１０）。すなわち、データ一時記憶部２９の引用階層記憶領域に一時的に記憶した引用記号の文字列を各々の文の先頭に付加すると、表８に示すようなメールの本文データとなる。

When the end is reached, a quotation mark is inserted at the head of each sentence (S410). That is, when a character string of a quotation symbol temporarily stored in the quotation hierarchy storage area of the data temporary storage unit 29 is added to the head of each sentence, mail body data as shown in Table 8 is obtained.

そして、データＳｉは最後のデータかどうかを判定し（Ｓ４１１）、最後のデータまで達していない場合は、変数ｉをｐ分進めるべくｉ＝ｉ＋ｐとし（Ｓ４１２）、ステップＳ４０２に戻り、データＳｉ＋ｐからＳ４０２〜Ｓ４１１の処理を繰り返し行う。これにより、表９に示すように、「はい。」が文認識処理の対象となり、「はい。」が文の単位として取得される。

Then, it is determined whether or not the data Si is the last data (S411). If the data has not reached the last data, i = i + p is set to advance the variable i by p (S412), the process returns to step S402, and the data Si + p The processes of S402 to S411 are repeated. As a result, as shown in Table 9, “Yes.” Becomes the target of the sentence recognition process, and “Yes.” Is acquired as a sentence unit.

そして、「はい。」に一時的に記憶した引用記号の文字列を挿入すると、表１０に示すようなメールの本文データが得られる。

When the character string of the quote symbol temporarily stored in “Yes” is inserted, the mail body data as shown in Table 10 is obtained.

本発明の実施の形態の実施例１によれば、メールツールの設定により自動的に改行が行われ、メールの書き手が同じで階層が異なるデータとなった場合であっても、自動的に挿入された改行により乱れた階層を整形して、同一階層内で文の認識処理を行うので、文の認識精度を向上させることができる。このように、改行により乱れた階層を整形して文の認識精度を向上できるので、翻訳対象となるメールの本文データの翻訳精度も向上できる。例えば、翻訳装置では正しい翻訳結果を得るためには、翻訳対象となる文の単位を正確に認識する必要があるが、その要請にも応えることができる。 According to Example 1 of the embodiment of the present invention, line breaks are automatically made according to the setting of the mail tool, and even when the data is the same in the writer of the mail and in a different hierarchy, it is automatically inserted. Since the layer disturbed by the line break is shaped and the sentence recognition process is performed in the same layer, the sentence recognition accuracy can be improved. As described above, the accuracy of the sentence recognition can be improved by shaping the layer disturbed by the line feed, so that the translation accuracy of the mail body data to be translated can also be improved. For example, in order to obtain a correct translation result in a translation apparatus, it is necessary to accurately recognize a sentence unit to be translated, but the request can be met.

次に、本発明の実施の形態に係わる文書解析装置１１の実施例２の動作について説明する。図６は、本発明の実施の形態に係わる文書解析装置１１の実施例２の動作を示すフローチャートである。この実施例２は、図３に示すフローチャートのステップＳ２０８の条件が満たされないと判定された場合に、引用階層ＬｉがデータＳｉ＋１の長さより大きいかどうかを判定するステップＳ２０８ａを追加して設けたものである。 Next, the operation of Example 2 of the document analysis apparatus 11 according to the embodiment of the present invention will be described. FIG. 6 is a flowchart showing the operation of Example 2 of the document analysis apparatus 11 according to the embodiment of the present invention. The second embodiment is additionally provided with step S208a for determining whether or not the citation hierarchy Li is larger than the length of the data Si + 1 when it is determined that the condition of step S208 in the flowchart shown in FIG. 3 is not satisfied. It is.

これにより、同一階層判定部３１は、ステップＳ２０８の条件を満たさないとき、すなわち、データ長近似最大行の引用階層Ｌｉが次の行の引用階層Ｌｉ＋１より大きく、かつ結合文字列の長さがデータ長の最大値の近似範囲内ではないときは、データ長近似最大行の引用階層Ｌｉと次の行の文字列とを比較し、データ長近似最大行の引用階層が次の行の文字列より大きいときは、データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定するようにしたものである。 As a result, when the same hierarchy determination unit 31 does not satisfy the condition of step S208, that is, the citation hierarchy Li of the data length approximate maximum line is larger than the citation hierarchy Li + 1 of the next line, and the length of the combined character string is data When it is not within the approximate range of the maximum length, the citation hierarchy Li of the data length approximate maximum line is compared with the character string of the next line, and the citation hierarchy of the data length approximate maximum line is compared with the character string of the next line. When it is larger, the actual hierarchy of the data length approximate maximum line and the next line is determined to be a character string in the same hierarchy.

いま、メール本文データとして、次に示すメール本文データＡ４を用いて説明する。 Now, description will be made using the following mail body data A4 as mail body data.

［メール本文データＡ４］

[Mail text data A4]

メール本文データＡ４は、１１文字目で明示的に改行を入れたデータが複数回のメールのやりとりにより、メールツールの設定により２０文字で自動的に折り返され、本来同一の階層のデータが異なる階層になったものである。メール本文データＡ４は、図６のステップＳ２０８までの処理で表１１に示すメール本文データとなる。

In the mail body data A4, the data in which the line is explicitly inserted at the 11th character is automatically folded back to 20 characters according to the setting of the mail tool by exchanging the mail several times. It has become. The mail text data A4 becomes the mail text data shown in Table 11 by the processing up to step S208 in FIG.

表１１のメール本文データに対して、図６のステップＳ２０８において、１番目のデータＳ１の引用記号を取り除いた「メールツールの設定」と２番目のデータＳ２の「が」とを結合したデータの長さは２０となる。従って、ステップＳ２０８では、データ長最大値ＭＡＸである３９の近似値と判定されず、１番目のデータＳ１と２番目のデータＳ２とは結合しないと判定される。 For the mail body data in Table 11, in step S208 of FIG. 6, the data obtained by combining “setting of mail tool” obtained by removing the quotation mark of the first data S1 and “ga” of the second data S2. The length is 20. Therefore, in step S208, it is not determined that the approximate value is 39, which is the maximum data length value MAX, and it is determined that the first data S1 and the second data S2 are not combined.

そこで、実施例２では、図６のステップＳ２０８で結合しないと判定された場合に、２番目のデータＳ２の長さと１番目のデータＳ１の引用階層Ｌ１とを比較し、２番目のデータＳ２の長さが１番目のデータＳ１の引用階層Ｌ１より小さい場合は、同一階層のデータであると判定する（Ｓ２０８ａ）。そして、文書整形部３２は、ステップＳ２０８ａの判定に基づきデータＳｉとデータＳｉ＋１とを結合することになる。 Therefore, in the second embodiment, when it is determined in step S208 in FIG. 6 that they are not combined, the length of the second data S2 is compared with the citation hierarchy L1 of the first data S1, and the second data S2 When the length is smaller than the citation hierarchy L1 of the first data S1, it is determined that the data is in the same hierarchy (S208a). Then, the document shaping unit 32 combines the data Si and the data Si + 1 based on the determination in step S208a.

この、図６のフローチャートにおいて、１回目のステップＳ２０４〜Ｓ２１４の処理で、表１２に示すメール本文データとなる。

In the flowchart of FIG. 6, the mail text data shown in Table 12 is obtained by the first processing in steps S204 to S214.

そして、２回目のステップＳ２０４〜Ｓ２１４の処理で表１３に示すメール本文データとなる。

Then, the mail body data shown in Table 13 is obtained by the second processing of steps S204 to S214.

２回目のＳ２０４〜Ｓ２１４の処理で表１３に示すメール本文データのようにデータの階層が同じとなるため、図６のステップＳ２１３において、変数Ｊが引用階層の最大値ｎに達するか、または処理結果が前回と同じである場合は、文書整形部３２の処理を終了し(Ｓ２１３)、文認識処理に移行する（Ｓ２１５）。 Since the data hierarchy is the same as the mail body data shown in Table 13 in the second processing of S204 to S214, the variable J reaches the maximum value n of the citation hierarchy in step S213 in FIG. If the result is the same as the previous time, the process of the document shaping unit 32 is terminated (S213), and the process proceeds to the sentence recognition process (S215).

以上の説明では、同一階層判定部３１は、ステップＳ２０８の条件を満たさないときは、データ長近似最大行の引用階層Ｌｉと次の行の文字列とを比較し、データ長近似最大行の引用階層が次の行の文字列より大きいときは、データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定するようにしたが、データ長近似最大行の引用階層と予め定めた引用階層閾値とを比較し、データ長近似最大行の引用階層が予め定めた引用階層閾値以上であるときは、データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定するようにしてもよい。また、データ長近似最大行の次の行の文字列と予め定めた文字数とを比較し、データ長近似最大行の次の行の文字列が予め定めた文字数以下であるときは、データ長近似最大行とその次の行との実際の階層は同一階層の文字列であると判定するようにしてもよい。 In the above description, when the condition of step S208 is not satisfied, the same hierarchy determination unit 31 compares the citation hierarchy Li of the data length approximate maximum line with the character string of the next line, and quotes the data length approximate maximum line. When the hierarchy is larger than the character string of the next line, it is determined that the actual hierarchy of the data length approximate maximum line and the next line are the same hierarchy character string. Compare the citation hierarchy with a predetermined citation hierarchy threshold, and if the citation hierarchy of the data length approximate maximum line is equal to or greater than the predetermined citation hierarchy threshold, the actual hierarchy of the data length approximate maximum line and the next line May be determined to be character strings in the same hierarchy. In addition, the character string of the next line of the data length approximation maximum line is compared with the predetermined number of characters, and if the character string of the next line of the data length approximation maximum line is less than the predetermined number of characters, the data length approximation You may make it determine with the actual hierarchy of the largest line and the following line being the character string of the same hierarchy.

本発明の実施の形態の実施例２によれば、実施例１の効果に加え、引用階層が深い場合であっても、自動的に挿入された改行により乱れた階層を整形できる。 According to Example 2 of the embodiment of the present invention, in addition to the effect of Example 1, even when the citation hierarchy is deep, it is possible to shape a hierarchy that is disturbed by a line feed that is automatically inserted.

次に、本発明の実施の形態に係わる文書解析装置１１の他の一例を説明する。図７は本発明の実施の形態に係わる文書解析装置１１の他の一例の機能ブロック図である。文書解析装置１１のハードウエア構成は、図２に示した構成と同じである。 Next, another example of the document analysis apparatus 11 according to the embodiment of the present invention will be described. FIG. 7 is a functional block diagram of another example of the document analysis apparatus 11 according to the embodiment of the present invention. The hardware configuration of the document analysis apparatus 11 is the same as that shown in FIG.

制御部２８は装置全体の制御を行うものであり、入力処理部２６から送られたメールの本文データを記憶装置２５のデータ一時記憶部２９に記憶したり、行結合判定閾値算出部３４、結合行判定部３５、データ整形部３６を制御したり、これらの演算結果を記憶したデータ一時記憶部２９の内容を出力処理部２７に送り出力装置１７に出力したりする。 The control unit 28 controls the entire apparatus. The control unit 28 stores the text data of the mail sent from the input processing unit 26 in the data temporary storage unit 29 of the storage device 25, the row combination determination threshold value calculation unit 34, and the combination. The line determination unit 35 and the data shaping unit 36 are controlled, and the contents of the data temporary storage unit 29 storing these calculation results are sent to the output processing unit 27 and output to the output device 17.

行結合判定閾値算出部３４は、記憶装置３５に記憶されたメールの本文データを行ごとに読み出して各行文字列の文字の数に基づき、各行ごとにデータ長を判定して記憶装置に記憶するとともに記憶装置に記憶されたデータ長の最大値とその最大値以外で出現頻度が高いデータ長の値との差分を行結合判定閾値として算出する。 The line combination determination threshold value calculation unit 34 reads the mail body data stored in the storage device 35 for each line, determines the data length for each line based on the number of characters in each line character string, and stores it in the storage device. At the same time, the difference between the maximum value of the data length stored in the storage device and the value of the data length having a high appearance frequency other than the maximum value is calculated as a row combination determination threshold value.

結合行判定部３５は、記憶装置２５に記憶されたメールの本文データを行ごとに読み出し、読み出した行のデータ長が最大値の近似範囲であり、その行のデータ長と次の行のデータ長との差が行結合判定閾値算出部３４で算出された行結合判定閾値の近似範囲であるときは、読み出した行及びその次の行に行結合の指標である繰り返し情報を付与して記憶装置２５に記憶するものである。 The combined row determination unit 35 reads the body data of the mail stored in the storage device 25 for each row, and the data length of the read row is an approximate range of the maximum value. The data length of the row and the data of the next row When the difference from the length is the approximate range of the row combination determination threshold value calculated by the row combination determination threshold value calculation unit 34, the read line and the next line are given and stored with repetition information that is an index of row combination. It is stored in the device 25.

データ整形部３６は、記憶装置２５に記憶されたメールの本文データを行ごとに読み出し、読み出した行の結合行判定部３５で付与された繰り返し情報で行間の結合が指示されているときは、行を結合して結合文字列を記憶部２５に記憶するものである。 The data shaping unit 36 reads the body data of the mail stored in the storage device 25 for each line, and when the combination between lines is instructed by the repetition information given by the combined line determination unit 35 of the read line, The combined character string is stored in the storage unit 25 by combining the lines.

次に、図７に示した本発明の実施の形態の文書解析装置の他の一例の動作について説明する。図８は本発明の実施の形態の文書解析装置の他の一例の動作を示すフローチャートである。 Next, the operation of another example of the document analysis apparatus according to the embodiment of the present invention shown in FIG. 7 will be described. FIG. 8 is a flowchart showing the operation of another example of the document analysis apparatus according to the embodiment of the present invention.

いま、入力処理部２６より、次に示すメールの本文データＡ５が入力されたとする。 Assume that the following mail body data A5 is input from the input processing unit 26.

「メールの本文データＡ５」

“Mail text data A5”

メールの本文データＡ５は、以下に示すメールの本文データＡ６がメールツールの設定により、全角２０文字の位置で改行されたものである。 The mail body data A5 is obtained by cutting the following mail body data A6 at a position of 20 full-width characters according to the setting of the mail tool.

「メールの本文データＡ６」

"Mail text data A6"

入力装置２０から演算処理装置１２にメールの本文データＡ５が入力されると、制御部２８は入力処理部２６を起動し、入力処理部２６はメールの本文データＡ５の改行毎に記憶部２５のデータ一時記憶部２９にデータを読み込む（Ｓ８０１）。表１４はデータ一時記憶部２９の入力データ記憶領域に記憶されたメールの本文データＡ５を示している。

When mail body data A5 is input from the input device 20 to the arithmetic processing unit 12, the control unit 28 activates the input processing unit 26, and the input processing unit 26 stores the data in the storage unit 25 for each line feed of the mail body data A5. Data is read into the temporary data storage unit 29 (S801). Table 14 shows the mail body data A5 stored in the input data storage area of the temporary data storage unit 29.

次に、制御部２８は行結合判定閾値算出部３４を起動する。行結合判定閾値算出部３４は、記憶装置３５のデータ一時記憶部２９の入力データ記憶領域に記憶されたメールの本文データＡ５を行ごとに読み出して各行文字列の文字の数に基づき各行ごとに各行のデータ長を判定し（Ｓ８０２）、記憶装置３５のデータ一時記憶部２９に記憶し、記憶装置２５のデータ一時記憶部２９に記憶したデータ長の最大値とその最大値以外で出現頻度が高いデータ長の値との差分を行結合判定閾値として算出する（Ｓ８０３）。 Next, the control unit 28 activates the row combination determination threshold value calculation unit 34. The line combination determination threshold value calculation unit 34 reads the mail body data A5 stored in the input data storage area of the data temporary storage unit 29 of the storage device 35 for each line, and for each line based on the number of characters in each line character string. The data length of each row is determined (S802), stored in the data temporary storage unit 29 of the storage device 35, and the appearance frequency is other than the maximum value of the data length stored in the data temporary storage unit 29 of the storage device 25 and the maximum value. The difference from the value of the high data length is calculated as the row combination determination threshold value (S803).

すなわち、行結合判定閾値算出部３４は、表１４のメールの本文データＡ５に対して、改行毎にデータを読み込み各行のデータ長さを調べ、その結果を表１５に示すようなメールの本文データとして記憶装置２５のデータ一時記憶部２９に記憶する。

That is, the line combination determination threshold value calculation unit 34 reads the data for each line feed for the mail body data A5 in Table 14 and checks the data length of each line, and the result is mail body data as shown in Table 15. Is stored in the data temporary storage unit 29 of the storage device 25.

そして、表１５のメールの本文データより、データ長最大値と、その最大値以外で出現頻度が高いデータ長とを検出し、データ長最大値から出現頻度が高いデータ長を引いた値を行結合判定閾値とする。 Then, the data length maximum value and a data length having a high appearance frequency other than the maximum value are detected from the body data of the mail in Table 15, and a value obtained by subtracting the data length having a high appearance frequency from the maximum data length is calculated. It is set as a combination determination threshold value.

ここで、表１５のメールの本文データでは、データ長最大値が４０、出現頻度が高いデータ長が１０であるので、ステップＳ８０３で算出される行結合判定閾値は３０となる。 Here, in the mail text data of Table 15, the maximum data length value is 40 and the data length with high appearance frequency is 10, so the row combination determination threshold value calculated in step S803 is 30.

次に、行結合判定部３５は、行データ指定変数ｉに初期値「１」をセットし（Ｓ８０４）、繰り返し情報を示す変数Ｑに初期値「１」をセットする（Ｓ８０５）。行データ指定変数ｉは表１５のメールの本文データから各行のデータＳｉを取得するための変数ｉであり、変数Ｑは行結合を判定するための変数である。 Next, the row combination determination unit 35 sets an initial value “1” to the row data designation variable i (S804), and sets an initial value “1” to a variable Q indicating repetition information (S805). The row data designation variable i is a variable i for acquiring the data Si of each row from the mail body data in Table 15, and the variable Q is a variable for determining row combination.

そして、行結合判定部３５は、表１５のメールの本文データからデータＳｉの文字列及びデータＳｉの長さを取得する（Ｓ８０６）。これにより、データＳｉの文字列として（メールツールの設定が、「メール送信時に２）が取得され、データＳｉの長さとして４０が取得される。これらのデータが取得できたかどうかを判定し（Ｓ８０７）、データが取得できない場合は、繰り返し情報の変数Ｑに「０」をセットし（Ｓ８０８）、行結合判定部３５は表１６のメールの本文データの繰り返し情報には「０」を記憶し処理を終了する。

Then, the row combination determining unit 35 acquires the character string of the data Si and the length of the data Si from the mail body data in Table 15 (S806). As a result, the character string of the data Si (the setting of the mail tool is “2 at the time of mail transmission”) is acquired, and 40 is acquired as the length of the data Si. It is determined whether or not these data have been acquired ( S807) If the data cannot be acquired, “0” is set to the variable Q of the repetition information (S808), and the row combination determination unit 35 stores “0” in the repetition information of the mail body data in Table 16. The process ends.

行結合判定部３５は、データＳｉの文字列及びデータＳｉの長さが取得できたときは、データＳｉの長さがデータ長最大値ＭＡＸの近似範囲であるかどうかを判定する（Ｓ８０９）。近似範囲はデータ長最大値ＭＡＸから補正値を引いた値を最大値の近似範囲判定値とする。ＭＡＸ−補正値＜｜データＳｉの長さ｜＜ＭＡＸとする。ここでは、補正値を例えば２とするが利用者が任意の値を設定できるようにしてもよい。 When the character string of the data Si and the length of the data Si can be acquired, the row combination determining unit 35 determines whether the length of the data Si is within the approximate range of the data length maximum value MAX (S809). The approximate range is obtained by subtracting the correction value from the data length maximum value MAX as the maximum approximate range determination value. MAX−correction value <| length of data Si | <MAX. Here, the correction value is set to 2, for example, but the user may be able to set an arbitrary value.

データＳｉの長さがデータ長最大値ＭＡＸの近似範囲でないときは、行結合判定部３５は、データＳｉの行が繰り返しの対象にはならないと判断し、繰り返し情報の変数Ｑに「０」をセットし（Ｓ８１０）、変数ｉに１を加えてｉ＝ｉ＋１とし（Ｓ８１１）、ステップＳ８０５の処理に戻る。この場合、表１６のメールの本文データの繰り返し情報には「０」が記憶される。 When the length of the data Si is not within the approximate range of the data length maximum value MAX, the row combination determining unit 35 determines that the row of the data Si is not a repetition target and sets “0” to the variable Q of the repetition information. It is set (S810), 1 is added to the variable i to set i = i + 1 (S811), and the process returns to step S805. In this case, “0” is stored in the repetition information of the mail body data in Table 16.

ステップＳ８０９の判定で、データＳｉの長さがデータ長最大値ＭＡＸの近似範囲である場合は、次のデータＳｉ＋１の文字列とデータＳｉ＋１の長さとを取得する（Ｓ８１２）。これにより、データＳｉ＋１の文字列として（０文字で自）、データＳｉ＋１の長さとして１０が取得される。データが取得できない場合は、これらのデータが取得できたかどうかを判定し（Ｓ８１３）、繰り返し情報の変数Ｑに「０」をセットし（Ｓ８０８）、行結合判定部３５は表１６のメールの本文データの繰り返し情報には「０」を記憶し処理を終了する。 If it is determined in step S809 that the length of the data Si is within the approximate range of the data length maximum value MAX, the character string of the next data Si + 1 and the length of the data Si + 1 are acquired (S812). Thus, 10 is acquired as the character string of the data Si + 1 (0 character is self) and the length of the data Si + 1. If the data cannot be acquired, it is determined whether or not these data have been acquired (S813), the variable Q of the repetition information is set to “0” (S808), and the row combination determination unit 35 reads the text of the mail in Table 16 “0” is stored in the data repetition information, and the process ends.

行結合判定部３５は、データＳｉ＋１の文字列及びデータＳｉ＋１の長さが取得できたときは、データＳｉの長さとデータＳｉ＋１の長さとの差分を算出する（Ｓ８１４）。続いて、その差分が行結合判定閾値と近似範囲であるかどうかを判定する（Ｓ８１５）。ここでは、差分は３０であるので、行結合判定閾値と近似範囲であるため、表１６のメールの本文データのデータＳｉとデータＳｉ＋１との繰り返し情報に変数Ｑの値（Ｑ＝１）をセットし、Ｑ＝Ｑ＋１とする（Ｓ８１６）。 When the character string of the data Si + 1 and the length of the data Si + 1 can be acquired, the row combination determining unit 35 calculates the difference between the length of the data Si and the length of the data Si + 1 (S814). Subsequently, it is determined whether or not the difference is an approximate range and the row combination determination threshold value (S815). Here, since the difference is 30, it is the approximate value of the row combination determination threshold, so the value of variable Q (Q = 1) is set in the repetition information of data Si and data Si + 1 of the mail body data in Table 16. Then, Q = Q + 1 is set (S816).

一方、ステップＳ８１５の判定で、差分が行結合判定閾値と近似範囲ではない場合には、繰り返し情報の変数Ｑに「０」をセットし（Ｓ８１７）、行結合判定部３５は表１６のメールの本文データの繰り返し情報には「０」を記憶する。そして、データＳｉ＋１が最後のデータかどうかを判定し（Ｓ８１８）、データＳｉ＋１が最後のデータである場合は、行結合判定部３５は処理を終了し、次のデータ整形処理に移行する（Ｓ８１９）。一方、最後のデータではない場合は、変数ｉに２を加えてｉ＝ｉ＋２とし（Ｓ８２０）、ステップＳ８０６に戻る。 On the other hand, if it is determined in step S815 that the difference is not within the approximate range of the row combination determination threshold, “0” is set to the variable Q of the repetition information (S817), and the row combination determination unit 35 receives the mail of Table 16 “0” is stored in the repetition information of the body data. Then, it is determined whether or not the data Si + 1 is the last data (S818). If the data Si + 1 is the last data, the row combination determination unit 35 ends the process and proceeds to the next data shaping process (S819). . On the other hand, if it is not the last data, 2 is added to the variable i so that i = i + 2 (S820), and the process returns to step S806.

続いて、表１６に示す繰り返し情報が付与されたメールの本文データに対してデータ整形処理が行われる（Ｓ８１９）。図９は、データ整形部３６のデータ整形の処理内容を示すフローチャートである。データ整形部３６でのデータ整形処理は、読み出した行に付与された繰り返し情報で行間の結合が指示されているときは、行を結合して結合文字列を作成しデータを整形するものである。 Subsequently, a data shaping process is performed on the body data of the mail with the repetition information shown in Table 16 (S819). FIG. 9 is a flowchart showing the contents of the data shaping process of the data shaping unit 36. In the data shaping process in the data shaping unit 36, when combination between lines is instructed by the repetition information given to the read line, the line is combined to create a combined character string and the data is shaped. .

まず、表１６に示す繰り返し情報が付与されたメールの本文データにアクセスするための行データ指定変数ｉに「１」をセットするとともに、繰り返し状態変数Ｋに「０」をセットする（Ｓ９０１）。 First, “1” is set to the row data designation variable i for accessing the mail body data to which the repetition information shown in Table 16 is assigned, and “0” is set to the repetition state variable K (S901).

続いて、表１６に示すメールの本文データよりデータＳｉの文字列とデータＳｉの繰り返し情報とを取得する（Ｓ９０２）。これらのデータが取得できたかどうかを判定し（Ｓ９０３）、データが取得できない場合は、データ整形部３６は処理を終了する。データＳｉの文字列とデータＳｉの繰り返し情報とが取得できた場合は、繰り返し情報が０か否かを判定する（Ｓ９０４）。 Subsequently, the character string of the data Si and the repetitive information of the data Si are acquired from the mail body data shown in Table 16 (S902). It is determined whether or not these data can be acquired (S903). If the data cannot be acquired, the data shaping unit 36 ends the process. If the character string of the data Si and the repetition information of the data Si can be acquired, it is determined whether or not the repetition information is 0 (S904).

繰り返し情報が０の場合は、次の行とは結合しないため、データＳｉに改行を付与して整形し、その整形結果したデータＳｉを記憶する（Ｓ９０５）。この場合、データＳｉは記憶装置２５のデータ一時記憶部２９の整形結果記憶領域に記憶される。そして、変数ｉに１を加算しｉ＝ｉ＋１として（Ｓ９０６）、ステップＳ９０２に戻る。 If the repetitive information is 0, it is not combined with the next line. Therefore, the data Si is given a line feed and is shaped, and the data Si obtained as a result of the shaping is stored (S905). In this case, the data Si is stored in the shaping result storage area of the data temporary storage unit 29 of the storage device 25. Then, 1 is added to the variable i to set i = i + 1 (S906), and the process returns to step S902.

繰り返し情報が０以外の場合は、データＳｉの次のデータＳｉ＋１の文字列とデータＳｉ＋１の繰り返し情報とを取得する（Ｓ９０７）。これらのデータが取得できたかどうかを判定し（Ｓ９０８）、データが取得できない場合は、データ整形部３６は処理を終了する。 If the repetition information is other than 0, the character string of the data Si + 1 next to the data Si and the repetition information of the data Si + 1 are acquired (S907). It is determined whether or not these data have been acquired (S908). If the data cannot be acquired, the data shaping unit 36 ends the process.

データＳｉ＋１の文字列とデータＳｉ＋１の繰り返し情報とが取得できた場合には、データＳｉ＋１の次のデータＳｉ＋２の繰り返し情報を取得する（Ｓ９０９）。データＳｉ＋２の繰り返し情報が取得できたかどうかを判定し（Ｓ９１０）、データＳｉ＋２の繰り返し情報が取得できた場合は０か否かを判定する（Ｓ９１１）。 When the character string of the data Si + 1 and the repetition information of the data Si + 1 can be acquired, the repetition information of the data Si + 2 next to the data Si + 1 is acquired (S909). It is determined whether or not the repetition information of the data Si + 2 has been acquired (S910). If the repetition information of the data Si + 2 has been acquired, it is determined whether or not it is 0 (S911).

ステップＳ９１０の判定でＳｉ＋２の繰り返し情報が取得できない場合、または、ステップＳ９１１の判定でデータＳｉ＋２の繰り返し情報が０の場合は、変数Ｋは「１」がどうかを判定する（Ｓ９１２）。ステップＳ９１２の判定で変数Ｋの値が「１」である場合は、データＳｉの文字列とデータＳｉ＋１の文字列とを結合して末尾に改行を付与し、記憶装置２５のデータ一時記憶部２９の整形結果記憶領域に記憶する。すなわち、ＳｉとＳｉ＋１とを結合して記憶する（Ｓ９１３）。繰り返し変数Ｋが「１」以外の場合は、データＳｉとデータＳｉ＋１の末尾にそれぞれ改行を付与し、記憶装置２５のデータ一時記憶部２９の整形結果記憶領域に記憶する。すなわち、ＳｉとＳｉ＋１とそれぞれ記憶する（Ｓ９１４）。そして、繰り返し状態が解除されるため、変数Ｋに「０」をリセットする（Ｓ９１５）。 If the repetition information of Si + 2 cannot be acquired in the determination in step S910, or if the repetition information of data Si + 2 is 0 in the determination in step S911, it is determined whether or not the variable K is “1” (S912). If the value of the variable K is “1” in the determination in step S912, the character string of the data Si and the character string of the data Si + 1 are combined to add a line feed at the end, and the temporary data storage unit 29 of the storage device 25 Is stored in the shaping result storage area. That is, Si and Si + 1 are combined and stored (S913). When the repetition variable K is other than “1”, a line feed is added to the end of each of the data Si and the data Si + 1 and stored in the shaping result storage area of the data temporary storage unit 29 of the storage device 25. That is, Si and Si + 1 are stored (S914). Since the repeated state is canceled, “0” is reset to the variable K (S915).

一方、ステップＳ９１１の判定で、データＳｉ＋２の繰り返し情報が「０」以外の場合は、データＳｉとデータＳｉ＋１とデータＳｉ＋２とデータＳｉ＋３とが一定の規則で連続する行と判断できる。そこで、データＳｉの文字列とデータＳｉ＋１の文字列とを結合して末尾に改行を付与し、記憶装置２５のデータ一時記憶部２９の整形結果記憶領域に記憶する。すなわち、ＳｉとＳｉ＋１とを結合して記憶する（Ｓ９１６）。そして、繰り返し状態にあるため、変数Ｋに「１」をセットする（Ｓ９１７）。データＳｉ＋１は最後のデータであるかどうかを判定し（Ｓ９１８）、最後のデータである場合は処理を終了し、最後のデータではない場合は、変数ｉに２を加算しｉ＝ｉ＋２として（Ｓ９１９）、Ｓ９０２に戻る。 On the other hand, if it is determined in step S911 that the repetition information of the data Si + 2 is other than “0”, it can be determined that the data Si, the data Si + 1, the data Si + 2, and the data Si + 3 are consecutive lines according to a certain rule. Therefore, the character string of the data Si and the character string of the data Si + 1 are combined to give a line feed at the end, and stored in the shaping result storage area of the data temporary storage unit 29 of the storage device 25. That is, Si and Si + 1 are combined and stored (S916). Since it is in a repeated state, “1” is set to the variable K (S917). It is determined whether or not the data Si + 1 is the last data (S918). If the data is the last data, the process is terminated. If it is not the last data, 2 is added to the variable i to set i = i + 2 (S919). ), The process returns to S902.

ステップＳ９０１〜ステップＳ９１９の処理が終了した時点で、記憶装置２５のデータ一時記憶部２９の整形結果記憶領域に記憶された表１７に示すようなメールの本文データが得られる。表１７に示すように、文字列が整形されている。

When the processes in steps S901 to S919 are completed, mail body data as shown in Table 17 stored in the shaping result storage area of the data temporary storage unit 29 of the storage device 25 is obtained. As shown in Table 17, the character string is formatted.

このように、メールツールの設定により自動的に改行が行われ、文の途中で改行されて、長い文、短い文、長い文、短い文のような状態となったとしても、文のデータ整形処理を行うことにより文を整形できる。 In this way, line breaks are automatically performed according to the settings of the mail tool, and even if the line breaks in the middle of the sentence and it becomes a state like a long sentence, short sentence, long sentence, short sentence, the data formatting of the sentence Sentences can be shaped by processing.

本発明の実施の形態によれば、自動的に挿入された改行により乱れた階層を整形した後に、同一階層内で文の認識処理を行うので、文の認識精度を向上させることができる。 According to the embodiment of the present invention, the sentence recognition processing is performed in the same hierarchy after shaping the hierarchy disturbed by the automatically inserted line feed, so that the sentence recognition accuracy can be improved.

本発明の実施の形態に係わる文書解析装置の機能ブロック図。The functional block diagram of the document analysis apparatus concerning embodiment of this invention. 本発明の実施の形態に係る文書解析装置のハードウエア構成を示すブロック構成図。The block block diagram which shows the hardware constitutions of the document analysis apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係わる文書解析装置の実施例１の動作を示すフローチャート。The flowchart which shows operation | movement of Example 1 of the document analysis apparatus concerning embodiment of this invention. 本発明の実施の形態における文書階層判定部の文書階層判定の処理内容を示すフローチャート。The flowchart which shows the processing content of the document hierarchy determination of the document hierarchy determination part in embodiment of this invention. 本発明の実施の形態における文認識処理部の文認識処理の処理内容を示すフローチャート。The flowchart which shows the processing content of the sentence recognition process of the sentence recognition process part in embodiment of this invention. 本発明の実施の形態に係わる文書解析装置の実施例２の動作を示すフローチャート。The flowchart which shows operation | movement of Example 2 of the document analysis apparatus concerning embodiment of this invention. 本発明の実施の形態に係わる文書解析装置の他の一例の機能ブロック図。The functional block diagram of another example of the document analysis apparatus concerning embodiment of this invention. 本発明の実施の形態の文書解析装置の他の一例の動作を示すフローチャート。The flowchart which shows operation | movement of another example of the document analysis apparatus of embodiment of this invention. 本発明の実施の形態におけるデータ整形部のデータ整形の処理内容を示すフローチャート。The flowchart which shows the processing content of the data shaping of the data shaping part in embodiment of this invention.

Explanation of symbols

１１…文書解析装置、１２…演算制御装置、１３…プロセッサ、１４…メモリ、１５…文書解析プログラム、１６…作業エリア、１７…出力装置、１８…表示装置、１９…通信制御装置、２０…入力装置、２１…マウス、２２…キーボード、２３…ディスクドライブ、２４…ハードディスクドライブ、２５…記憶装置、２６…入力処理部、２７…出力処理部、２８…制御部、２９…データ一時記憶部、３０…文書階層判定部、３１…同一階層判定部、３２…文書整形部、３３…文認識処理部、３４…行結合判定閾値算出部、３５…結合行判定部、３６…データ整形部 DESCRIPTION OF SYMBOLS 11 ... Document analysis device, 12 ... Arithmetic control device, 13 ... Processor, 14 ... Memory, 15 ... Document analysis program, 16 ... Work area, 17 ... Output device, 18 ... Display device, 19 ... Communication control device, 20 ... Input Device, 21 ... Mouse, 22 ... Keyboard, 23 ... Disk drive, 24 ... Hard disk drive, 25 ... Storage device, 26 ... Input processing unit, 27 ... Output processing unit, 28 ... Control unit, 29 ... Data temporary storage unit, 30 ... Document hierarchy determination unit, 31 ... Same hierarchy determination unit, 32 ... Document shaping unit, 33 ... Sentence recognition processing unit, 34 ... Line combination determination threshold value calculation unit, 35 ... Combined line determination unit, 36 ... Data shaping unit

Claims

For each line character string that is read from the body data of the email input from the input device for each line feed and stored in the storage device , the data length is determined for each line based on the number of characters in each line character string, and each line character string is quoted A document hierarchy determination unit that determines the citation hierarchy for each line based on the number of symbols;
If the citation hierarchy of the data length approximation maximum line that is within the predetermined range of the maximum value among the data lengths determined by the document hierarchy determination unit is larger than the citation hierarchy of the next line, the data length approximation maximum line The length of the combined character string between the character string from which the quotation mark and line feed are removed and the character string from which the quotation mark and line feed are removed from the next line is the maximum value of the data length. It is determined whether or not it is within a predetermined range, and when the length of the combined character string is within a predetermined range of the maximum value of the data length, the actual hierarchy of the data length approximate maximum line and the next line is The same hierarchy determination unit for determining that the character strings are in the same hierarchy;
Of the line character strings stored in the storage device, the data length approximate maximum line determined to be the same hierarchy by the same hierarchy determination unit and the character string of the next line , the data length approximate maximum line A document analysis apparatus comprising: a document formatting unit that removes a quotation mark from the next line, shapes the line by approximating the end of the maximum data length approximate line, and outputs the entire line character string to the output device.

When the length of the combined character string is not within a predetermined range of the maximum value of the data length, the same layer determination unit determines the length of the reference layer of the data length approximate maximum line and the character string of the next line. And when the citation hierarchy of the data length approximate maximum line is larger than the length of the character string of the next line, the actual hierarchy of the data length approximate maximum line and the next line is a character string of the same hierarchy. The document analysis apparatus according to claim 1, wherein the document analysis apparatus determines that it is.

When the length of the combined character string is not within a predetermined range of the maximum value of the data length, the same layer determination unit determines a reference layer of the data length approximate maximum line and a predetermined reference layer threshold value. In comparison, when the citation hierarchy of the data length approximate maximum line is equal to or greater than a predetermined citation hierarchy threshold, the actual hierarchy of the data length approximate maximum line and the next line is a character string of the same hierarchy The document analysis apparatus according to claim 1, wherein the determination is performed.

When the combined character string length is not within a predetermined range of the maximum value of the data length, the same hierarchy determination unit determines the character string of the next line of the data length approximate maximum line and the predetermined number of characters. When the length of the character string in the next row of the data length approximate maximum row is equal to or less than the predetermined number of characters, the actual hierarchy of the data length approximate maximum row and the next row is the same The document analysis apparatus according to claim 1, wherein the document analysis apparatus determines that the character string is a hierarchical character string.

A storage device that stores a document analysis program, an input device that inputs mail body data, an arithmetic control device that executes the document analysis program stored in the storage device, and outputs an arithmetic result of the arithmetic control device In a document analysis program used in a document analysis apparatus configured by a computer having an output device for
For each line string stored in the storage device, the body text data of the mail input from the input device is read for each line feed, and the data length is determined for each line based on the number of characters in each line string. A procedure for determining the citation hierarchy for each row based on the number of quotes,
If the citation hierarchy of the data length approximation maximum line whose data length is within the predetermined range of the determined data length is larger than the citation hierarchy of the next line, it is included in the character string of the data length approximation maximum line Whether the length of the combined character string between the character string from which the quotation mark and line feed are removed and the character string from which the quotation mark and line feed are removed is within the predetermined range of the maximum value of the data length A procedure for determining whether or not
When the length of the combined character string is within a predetermined range of the maximum value of the data length, it is determined that the actual hierarchy of the approximate maximum data length line and the next line is a character string in the same hierarchy. Procedure and
Among the line character strings stored in the storage device, for the character string of the data length approximate maximum line and the next line determined to be the same hierarchy, the citation of the line next to the data length approximate maximum line A document analysis program for executing a procedure of removing symbols and shaping by the process of combining at the end of the approximate maximum data length line and outputting the entire line character string to the output device.

For each line character string that is read from the body data of the email input from the input device for each line feed and stored in the storage device , the data length is determined for each line based on the number of characters in each line character string, and each line character string is quoted Determine the citation hierarchy for each line based on the number of symbols,
If the citation hierarchy of the data length approximation maximum line whose data length is within the predetermined range of the determined data length is larger than the citation hierarchy of the next line, it is included in the character string of the data length approximation maximum line The length of the combined character string between the character string from which the quotation mark and the line feed are removed and the character string from which the quotation mark and the line feed are removed included in the next line is within a predetermined range of the maximum value of the data length. Whether or not
When the length of the combined character string is within a predetermined range of the maximum value of the data length, it is determined that the actual hierarchy of the data length approximate maximum line and the next line is a character string in the same hierarchy. ,
Among the line character strings stored in the storage device, for the character string of the data length approximate maximum line and the next line determined to be the same hierarchy, the citation of the line next to the data length approximate maximum line A document analysis method that removes symbols, shapes them by a process of combining them at the end of the approximate maximum data length line, and outputs the entire line character string to the output device.