JP2006202004A

JP2006202004A - Morpheme analysis device and its method

Info

Publication number: JP2006202004A
Application number: JP2005012566A
Authority: JP
Inventors: Hisako Asano; 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-01-20
Filing date: 2005-01-20
Publication date: 2006-08-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a morpheme analysis device and its method for efficiently and correctly analyzing a Japanese sentence including words (character string) constituted of alphabets including large characters. <P>SOLUTION: A word chain generating means 6 extracts words satisfying a part of speech chain which can be connected from an input text by using a word dictionary 3 and a grammar rule 4, and extracts words by re-retrieving the word dictionary 3 by converting large characters in the character string into small characters to prepare a word chain candidate column as for a character string constituted of alphabets including large characters, that is, an unknown character string whose notation is not registered in the word dictionary 3. A word selecting means 7 performs predetermined cost calculation for the word chain candidate string, and selects and outputs a word information string constituted of a series of word chain corresponding to the head to tail of the input text. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、入力された日本語文（日本語テキスト）を単語に分割し、当該単語毎に読みや品詞等の情報が付与された単語情報列を出力する形態素解析技術に関する。 The present invention relates to a morphological analysis technique that divides an input Japanese sentence (Japanese text) into words and outputs a word information string to which information such as readings and parts of speech is assigned for each word.

日本語文の形態素解析としては、単語辞書及び文法規則を用いて、入力された日本語文から接続可能な品詞連鎖を満たす単語を抽出（認定）し、該抽出した単語をその読みや品詞等の情報とともに前記日本語文の文頭から文末までに対応させて組み合わせた単語連鎖の候補である単語連鎖候補列を生成し（単語連鎖生成処理）、当該単語連鎖候補列に対して所定のコスト計算を行い、前記日本語文の文頭から文末までに対応する一連の単語連鎖からなる単語情報列を少なくとも１つ選択して出力する（単語選択処理）手法が一般的である（非特許文献１，２参照）。 For morphological analysis of Japanese sentences, a word dictionary and grammatical rules are used to extract (certify) words that satisfy a connectable part-of-speech chain from the input Japanese sentence, and the extracted words are information such as their readings and parts of speech. And generating a word chain candidate string that is a word chain candidate combined in correspondence from the beginning of the sentence to the end of the sentence (word chain generation processing), performing a predetermined cost calculation on the word chain candidate string, A method of selecting and outputting at least one word information string composed of a series of word chains corresponding from the beginning to the end of the Japanese sentence (word selection processing) is generally used (see Non-Patent Documents 1 and 2).

この際、抽出される単語は、通常、単語辞書に登録されている辞書登録語、あるいは単語辞書に登録されていない未知の単語（辞書登録語により品詞連鎖が生成できない場合）のいずれかとなる。
吉村賢治、他「文節数最小法を用いたべた書き日本語文の形態素解析」情報処理学会論文誌、Ｖｏｌ．２４、Ｎｏ．１、１９８３、ｐ．４０〜４６久光徹、他「接続コスト最小法による形態素解析の提案と計算量の評価について」電子情報通信学会技術研究報告、Ｖｏｌ．９０、Ｎｏ．１１６、１９９０、ｐ．１７〜２４ At this time, the extracted word is usually either a dictionary registered word registered in the word dictionary or an unknown word not registered in the word dictionary (when a part-of-speech chain cannot be generated by a dictionary registered word).
Kenji Yoshimura, et al. “Morphological analysis of written Japanese sentences using the minimum number of phrases”, Transactions of Information Processing Society of Japan, Vol. 24, no. 1, 1983, p. 40-46 Toru Hisamitsu, et al. “Proposal of morphological analysis by the minimum connection cost method and evaluation of computational complexity” IEICE technical report, Vol. 90, no. 116, 1990, p. 17-24

このように、従来の単語連鎖生成処理では、単語辞書にその表記が登録されていない単語（文字列）については未知の単語（以下、未知語）として扱うため、小文字のアルファベットからなる一般的な単語（英単語）が単語辞書に登録してあった場合でも、文頭で先頭の文字がキャピタライズ（大文字化）された単語や強調等により全ての文字がキャピタライズ（大文字化）された単語は表記が異なるため、未知語として扱われてしまうという問題があった。例えば、「restaurant」という表記の単語が単語辞書に登録されていたとしても、「Restaurant」（先頭文字キャピタライズ）、「RESTAURANT」（全文字キャピタライズ）が単語辞書に登録されていなければ、それらは未知語として扱われてしまう。 As described above, in the conventional word chain generation process, a word (character string) whose notation is not registered in the word dictionary is treated as an unknown word (hereinafter, unknown word). Even if a word (English word) is registered in the word dictionary, the first character at the beginning of the sentence is capitalized (capitalized) or all characters are capitalized (capitalized) by emphasis etc. There is a problem that it is treated as an unknown word because it is different. For example, even if the word “restaurant” is registered in the word dictionary, it is unknown if “Restaurant” (first character capitalize) or “RESTAURANT” (all character capitalize) is not registered in the word dictionary. It will be treated as a word.

この問題は、単語「restaurant」を登録する際に「Restaurant」や「RESTAURANT」も併せて辞書登録すれば解決されるが、それだけ単語辞書のサイズが増大するという新たな問題を生じ、且つ、「restaurant」に関する単語辞書情報を修正する際には「Restaurant」及び「RESTAURANT」についてもその単語辞書情報を修正しなければならず、その分、単語辞書の整備（修正・更新）に必要な作業量や処理量が増加し、コストアップにつながるという新たな問題を生じる。 This problem can be solved by registering the dictionary with `` Restaurant '' and `` RESTAURANT '' when registering the word `` restaurant '', but it causes a new problem that the size of the word dictionary increases accordingly, and `` When the word dictionary information related to “restaurant” is corrected, the word dictionary information must also be corrected for “Restaurant” and “RESTAURANT”, and the amount of work required to maintain (correct / update) the word dictionary accordingly. And the amount of processing increases, resulting in a new problem that leads to an increase in cost.

また、入力された日本語文中の全てのアルファベットを小文字化してから形態素解析を行えば、上記で挙げた「Restaurant」や「RESTAURANT」等への対処は可能であるが、この場合も、固有名詞や略語等の常にキャピタライズ表記される単語であって同一綴りの小文字表記の単語とは通常、別に辞書登録される単語との判別が困難になる（例えば、「IT」は「Information Technology」の略語の「アイティー」を表すことが多いが、「it」は「イット」を表すことが多い）という新たな問題を生じ、また、入力された日本語文を変形して形態素解析を行うため、本来の表記が必要な場合には、元の日本語文を保存しておき、解析後に元の日本語文に置き換える処理を行わなければならず、その分、多くのメモリが必要となり、処理量も増加し、コストアップにつながるという新たな問題を生じる。 In addition, if morpheme analysis is performed after lowering all alphabets in the input Japanese sentence, it is possible to deal with "Restaurant" and "RESTAURANT" mentioned above. It is difficult to distinguish words that are always capitalized words, such as or abbreviations, and that have the same spelling in lowercase letters from words that are registered in a separate dictionary (for example, “IT” is an abbreviation for “Information Technology”) "It" often represents "it", but "it" often represents "it"), and the input Japanese sentence is transformed to perform morphological analysis. If the notation is required, the original Japanese sentence must be saved and replaced with the original Japanese sentence after analysis, which requires a lot of memory and increases the amount of processing. To increase costs It creates a new problem of being connected.

本発明は、上記の点に鑑みなされたもので、大文字を含むアルファベットからなる単語（文字列）を含む日本語文を効率良く且つ正しく解析可能な形態素解析装置及びその方法を提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a morphological analysis apparatus and method capable of efficiently and correctly analyzing a Japanese sentence including a word (character string) composed of alphabets including capital letters. To do.

本発明は、単語辞書及び文法規則を用いて、入力された日本語文から単語連鎖侯補列を生成し、当該単語連鎖候補列に対してコスト計算を行い、単語情報列を出力する形態素解析において、単語連鎖侯補列を生成する際に、大文字を含むアルファベットからなる文字列であり且つ単語辞書にその表記が登録されていない未知の文字列については、その文字列中の大文字を小文字化した上で単語辞書を再検索することを特徴とする。 The present invention uses a word dictionary and a grammatical rule to generate a word chain supplementary sequence from an input Japanese sentence, perform cost calculation on the word chain candidate sequence, and output a word information sequence. When generating a word chain complement, for an unknown character string that is composed of alphabets including uppercase letters and whose notation is not registered in the word dictionary, uppercase letters in the character string are changed to lowercase. The word dictionary is re-searched above.

本発明によれば、キャピタライズされた英単語を含む日本語文を効率良く且つ正しく形態素解析できるようになる。 According to the present invention, a Japanese sentence including capitalized English words can be analyzed efficiently and correctly.

以下、この発明を図示の実施の形態により詳細に説明する。 Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments.

図１は本発明の実施の形態にかかる形態素解析装置を示すもので、図中、１は第１の記憶手段、２は第２の記憶手段、３は単語辞書、４は文法規則、５は中央処理装置（ＣＰＵ）である。 FIG. 1 shows a morphological analyzer according to an embodiment of the present invention. In the figure, 1 is a first storage means, 2 is a second storage means, 3 is a word dictionary, 4 is a grammar rule, Central processing unit (CPU).

第１の記憶手段１は、図示しないキーボード等から直接入力され又は記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力された日本語文（以下、入力テキストと呼ぶ。）を記憶する。 The first storage means 1 is input directly from a keyboard (not shown) or the like, or read and input from a storage medium, or input from another device or the like via a communication medium (hereinafter referred to as input text). ) Is memorized.

第２の記憶手段２は、後述する単語連鎖生成手段によって生成された単語連鎖侯補列を記憶する。 The second storage unit 2 stores a word chain complement sequence generated by a word chain generation unit described later.

単語辞書３は、少なくとも１つの文字を含む単語をその読みや品詞等の情報とともに多数登録してなるものであり、また、文法規則４は、様々な品詞同士の連鎖に関し、接続可能か不能かを記述してなるものであり、これらは実際には図示しない記憶装置に記憶・保持されている。 The word dictionary 3 is formed by registering a large number of words including at least one character together with information such as readings and parts of speech, and the grammatical rule 4 indicates whether connection is possible or impossible with respect to a chain of various parts of speech. These are actually stored and held in a storage device (not shown).

ＣＰＵ５は、図２乃至図４にフローチャートで示すプログラムに従って前述した各部を制御するとともに、この際、単語連鎖生成手段６及び単語選択手段７を構成する。 The CPU 5 controls each unit described above according to the program shown in the flowcharts of FIGS.

単語連鎖生成手段６は、第１の記憶手段１から入力テキストを読み出し、単語辞書３及び文法規則４を用いて、当該入力テキストから接続可能な品詞連鎖を満たす単語を抽出し、この際、大文字を含むアルファベットからなる文字列であり且つ単語辞書３にその表記が登録されていない未知の文字列についてはその文字列中の大文字を小文字化した上で単語辞書３を再検索して単語を抽出し、該抽出した単語をその読みや品詞等の情報とともに前記入力テキストの文頭から文末までに対応させて組み合わせた単語連鎖の侯補である単語連鎖候補列を生成し、第２の記憶手段２に記憶する、あるいは、第１の記憶手段１から入力テキストを読み出し、単語辞書３及び文法規則４を用いて、当該入力テキストから接続可能な品詞連鎖を満たす単語を抽出し、該抽出した単語をその読みや品詞等の情報とともに前記入力テキストの文頭から文末までに対応させて組み合わせた単語連鎖の候補である単語連鎖候補列を生成し、当該単語連鎖候補列中に大文字を含むアルファベットからなる文字列であり且つ単語辞書３にその表記が登録されていない未知の文字列が含まれている場合はその文字列中の大文字を小文字化した上で単語辞書３を再検索し、該当する単語が存在する時は前記未知のアルファベット文字列を当該単語に置き換え、第２の記憶手段２に記憶する。 The word chain generation means 6 reads the input text from the first storage means 1 and extracts words that satisfy the connectable part-of-speech chain from the input text using the word dictionary 3 and the grammatical rule 4. For an unknown character string that is a character string consisting of alphabets including and whose notation is not registered in the word dictionary 3, extract the word by re-searching the word dictionary 3 with the capital letter in the character string lowercased Then, a word chain candidate string that is a complement of the word chain is generated by combining the extracted words together with information such as the reading and part of speech corresponding to the beginning of the input text to the end of the sentence, and the second storage means 2 Or a word satisfying a part-of-speech chain that can be connected from the input text using the word dictionary 3 and the grammatical rule 4 by reading the input text from the first storage means 1 Generating a word chain candidate string that is a candidate for a word chain that combines the extracted words together with information such as their readings and parts of speech in correspondence from the beginning of the input text to the end of the sentence, If the word dictionary 3 includes an unknown character string whose notation is not registered in the word dictionary 3, the word dictionary 3 is converted to the lower case after the capital letter in the character string is changed to lower case. Re-searching is performed, and when the corresponding word exists, the unknown alphabet character string is replaced with the word and stored in the second storage means 2.

単語選択手段７は、第２の記憶手段２から単語連鎖侯補列を読み出し、当該単語連鎖侯補列に対して所定のコスト計算（例えば、既存の文節数最小法やコスト最小法等、どのようなものを用いても良い。）を行い、入力テキストの文頭から文末までに対応する一連の単語連鎖のうち、優先順位の最も高いものを単語情報列として出力する。なお、この際、指定された数だけ、優先順位順に複数の単語情報列（多義）を出力しても良い。 The word selection unit 7 reads the word chain complement sequence from the second storage unit 2 and performs a predetermined cost calculation on the word chain sequence (for example, any existing phrase minimum method, cost minimum method, etc. And the highest priority among a series of word chains corresponding from the beginning to the end of the input text is output as a word information string. At this time, a plurality of word information strings (ambiguity) may be output in the order of priority in the designated number.

以下、本発明装置による形態素解析方法を説明するが、まず、図２に従い、形態素解析の全体的な流れについて説明する。 Hereinafter, the morpheme analysis method using the apparatus of the present invention will be described. First, the overall flow of morpheme analysis will be described with reference to FIG.

まず、ＣＰＵ５は、入力テキストが図示しないキーボード等から直接入力され又は記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力されると、これを第１の記憶手段１に記憶する（ｓ１）。 First, when the input text is directly input from a keyboard or the like (not shown), or read and input from a storage medium or input from another device or the like via a communication medium, the CPU 5 stores the input text in the first storage unit 1. (S1).

次に、ＣＰＵ５は、その単語連鎖生成手段６により、第１の記憶手段１の記憶内容を読み出し（ｓ２）、単語辞書３及び文法規則４を用いて、後述する単語連鎖生成処理を行って単語連鎖候補列を作成し（ｓ３）、これを第２の記憶手段２に記憶する（ｓ４）。 Next, the CPU 5 reads out the stored contents of the first storage means 1 by the word chain generation means 6 (s2), and uses the word dictionary 3 and the grammatical rule 4 to perform word chain generation processing to be described later to obtain the word A chain candidate sequence is created (s3) and stored in the second storage means 2 (s4).

最後に、ＣＰＵ５は、その単語選択手段７により、第２の記憶手段２から単語連鎖侯補列を読み出し（ｓ５）、従来の場合と同様な単語選択処理を行って単語情報列を出力し（ｓ６）、処理を終了する。 Finally, the CPU 5 reads the word chain supplementary sequence from the second storage unit 2 by the word selection unit 7 (s5), performs a word selection process similar to the conventional case, and outputs a word information sequence ( s6) The process is terminated.

次に、図３及び図４に従い、単語連鎖生成手段６による単語連鎖生成処理について詳細に説明する。以下の処理で利用する、所定の辞書検索・接続チェックアルゴリズムとしては、任意の辞書検索・接続チェックアルゴリズム（非特許文献１、２に示したように各種の既存技術が存在する）を用いて良い。なお、「辞書検索」とは単語辞書３に対する検索であり、「接続チェック」とは文法規則４を用いて単語間の接続の可否や接続のコストをチェックすることである。 Next, the word chain generation processing by the word chain generation means 6 will be described in detail with reference to FIGS. As a predetermined dictionary search / connection check algorithm used in the following processing, any dictionary search / connection check algorithm (various existing technologies exist as shown in Non-Patent Documents 1 and 2) may be used. . The “dictionary search” is a search for the word dictionary 3, and the “connection check” is to check the connection possibility between words using the grammar rule 4 and the connection cost.

図３は図２中の単語連鎖生成処理の詳細フローの一例を示すものである。 FIG. 3 shows an example of a detailed flow of the word chain generation process in FIG.

ステップｓ１１では、所定の辞書検索・接続チェックアルゴリズムにおける初期設定値（文頭から処理するアルゴリズムにおいては文頭、文末から処理するアルゴリズムにおいては文末）を設定し、ステップｓ１２に移る。 In step s11, an initial setting value in a predetermined dictionary search / connection check algorithm (the beginning of a sentence in an algorithm that processes from the beginning of a sentence, and the end of a sentence in an algorithm that processes from the end of a sentence) is set.

ステップｓ１２では、所定の辞書検索・接続チェックアルゴリズムを用いて、対象文字位置を起点として単語辞書検索を行い、ステップｓ１３に移る。 In step s12, a word dictionary search is performed starting from the target character position using a predetermined dictionary search / connection check algorithm, and the process proceeds to step s13.

ステップｓ１３では、マッチする（該当する）単語が単語辞書３に存在するか否かを判定する。条件を満たす場合にはステップｓ１９に、満たさない場合にはステップｓ１４に移る。 In step s13, it is determined whether or not a matching (corresponding) word exists in the word dictionary 3. If the condition is satisfied, the process proceeds to step s19. Otherwise, the process proceeds to step s14.

ステップｓ１４では、対象文字位置の先頭文字がアルファベットであるか否かを判定する。条件を満たす場合にはステップｓ１５に、満たさない場合にはステップｓ２０に移る。但し、所定の辞書検索・接続チェックアルゴリズムが文頭→文末へ処理するアルゴリズムである場合には、先頭文字が大文字のアルファベットであるか否かの判定に限定しても良い。この場合、後述するステップｓ１６の判定は行わず、そのままステップｓ１７に移れば良い。 In step s14, it is determined whether or not the first character at the target character position is an alphabet. If the condition is satisfied, the process proceeds to step s15. Otherwise, the process proceeds to step s20. However, when the predetermined dictionary search / connection check algorithm is an algorithm for processing from the beginning of a sentence to the end of the sentence, the determination may be limited to whether or not the first character is an uppercase alphabet. In this case, it is only necessary to proceed to step s17 without performing the determination in step s16 described later.

ステップｓ１５では、対象文字列の先頭から連続するアルファベットを抽出し、ステップｓ１６に移る。 In step s15, continuous alphabets are extracted from the beginning of the target character string, and the process proceeds to step s16.

ステップｓ１６では抽出した連続するアルファベットからなる文字列（アルファベット文字列）が大文字を含む否かを判定する。条件を満たす場合にはステップｓ１７に、満たさない場合にはステップｓ２０に移る。 In step s16, it is determined whether or not the extracted character string (alphabet character string) consisting of continuous alphabets includes uppercase letters. If the condition is satisfied, the process proceeds to step s17. Otherwise, the process proceeds to step s20.

ステップｓ１７では、抽出したアルファベット文字列を小文字化し、これに対して所定の辞書検索・接続チェックアルゴリズムを用いて単語辞書検索を行い、ステップｓ１８に移る。ここで、小文字化においては、全ての文字を小文字にしたもの（全小文字化）及び（文頭側の）先頭文字のみ大文字のままでその他は全て小文字にしたもの（先頭以外小文字化）の２種類に対して検索しても良く、又は、適用順序を予め決めておき（例えば、先頭以外小文字化、全小文字化の順に適用）、最初の文字列（ここでの例では、先頭以外小文字化）でマッチしなかった場合のみ、次の文字列（ここでの例では、全小文字化）を検索しても良く、あるいは、どちらか一方のみを検索しても良いものとする。 In step s17, the extracted alphabetic character string is converted to lower case letters, and a word dictionary search is performed using a predetermined dictionary search / connection check algorithm, and the process proceeds to step s18. Here, there are two types of lowercase letters: all lowercase letters (all lowercase letters), and the first letter (at the beginning of the sentence) with uppercase letters and all other lowercase letters (lowercase letters other than the first letter). Or the order of application may be determined in advance (for example, applied in the order of lowercase letters other than the first letter, then lowercase letters), and the first character string (in this example, lowercase letters other than the first letter) Only when there is no match, the next character string (in this example, all lowercase letters) may be searched, or only one of them may be searched.

ステップｓ１８では、マッチする単語が単語辞書３に存在するか否かを判定する。条件を満たす場合にはステップｓ１９に、満たさない場合にはステップｓ２０に移る。 In step s18, it is determined whether or not a matching word exists in the word dictionary 3. If the condition is satisfied, the process proceeds to step s19. Otherwise, the process proceeds to step s20.

ステップｓ１９では、単語辞書３に存在した単語（辞書検索単語）を単語連鎖侯補列に追加し（小文字化したアルファベット文字列にマッチした場合には、表記は元々の大文字アルファベット文字列に置き換える）、所定の辞書検索・接続アルゴリズムに従い、対象文字位置を更新し、ステップｓ２１に移る。 In step s19, a word (dictionary search word) existing in the word dictionary 3 is added to the word chain supplementary string (if the lowercase alphabetical character string is matched, the notation is replaced with the original uppercase alphabetic character string). The target character position is updated in accordance with a predetermined dictionary search / connection algorithm, and the process proceeds to step s21.

ステップｓ２０では、前述したアルファベット文字列を未知語として単語連鎖侯補列に追加し、所定の辞書検索・接続アルゴリズムに従い、対象文字位置を更新し、ステップｓ２１に移る。 In step s20, the aforementioned alphabet character string is added as an unknown word to the word chain supplementary column, the target character position is updated according to a predetermined dictionary search / connection algorithm, and the process proceeds to step s21.

ステップｓ２１では、全文字列に対して処理が終了した否かを判定する。条件を満たす場合には処理を終了し、満たさない場合にはステップｓ１２に移る。 In step s21, it is determined whether or not processing has been completed for all character strings. If the condition is satisfied, the process is terminated; otherwise, the process proceeds to step s12.

また、未知語として認定するために、単語辞書３に未知語を表す単語（通常１文字）を登録している辞書検索・接続チェックアルゴリズムにおいては、ステップｓ１３及びステップｓ１８の、マッチする単語が単語辞書３に存在するか否かの判定において、上記未知語を表す単語のみ単語辞書３から検索された場合には、「条件を満たさない」と判断しても良い。 In addition, in the dictionary search / connection check algorithm in which a word representing an unknown word (usually one character) is registered in the word dictionary 3 to be recognized as an unknown word, the matching word in steps s13 and s18 is a word. In the determination of whether or not it exists in the dictionary 3, if only the word representing the unknown word is searched from the word dictionary 3, it may be determined that “the condition is not satisfied”.

図４は図２中の単語連鎖生成処理の詳細フローの他の例を示すものである。 FIG. 4 shows another example of the detailed flow of the word chain generation process in FIG.

ステップｓ１１、ｓ１２、ｓ１３、ｓ１９、ｓ２０、ｓ２１での処理は、図３の例の場合と同様であり、これらによって、従来技術の場合と同様の未知語を含む単語連鎖候補列が生成される。 The processes in steps s11, s12, s13, s19, s20, and s21 are the same as those in the example of FIG. 3, and a word chain candidate string including unknown words similar to the case of the prior art is generated by these. .

ステップｓ３１では、ステップｓ１１〜ｓ２１で生成した単語連鎖候補列の中に、ステップｓ３２以降の処理を行っていない、大文字を含むアルファベットからなる未知の文字列（未知語）が存在するか否かを判定する。条件を満たす場合にはステップｓ３２に、満たさない場合には処理を終了する。 In step s31, it is determined whether or not there is an unknown character string (unknown word) composed of alphabets including capital letters that has not been processed in step s32 and subsequent steps in the word chain candidate strings generated in steps s11 to s21. judge. If the condition is satisfied, the process ends in step s32. Otherwise, the process ends.

ステップｓ３２では、前記アルファベット文字列を小文字化し、単語辞書検索を行う。これはステップｓ１７と同様の処理である。但し、接続チェックは、既に作成されている単語連鎖候補列における当該単語と連鎖する単語に対して行う。処理後、ステップｓ３３に移る。 In step s32, the alphabet character string is converted to lower case and a word dictionary search is performed. This is the same processing as step s17. However, the connection check is performed on a word chained with the word in the already created word chain candidate string. After processing, the process proceeds to step s33.

ステップｓ３３では、マッチする単語が単語辞書３に存在するか否かを判定する。条件を満たす場合にはステップｓ３４に、満たさない場合にはステップｓ３１に移る。 In step s33, it is determined whether or not a matching word exists in the word dictionary 3. If the condition is satisfied, the process proceeds to step s34. Otherwise, the process proceeds to step s31.

ステップｓ３４では、単語連鎮候補列における当該未知語を、単語辞書３に存在した単語（辞書検索単語）に置き換える。但し、表記は、元々の大文字を含むアルファベット文字列のままとする。処理後、ステップｓ３１に移る。 In step s34, the unknown word in the word continuous candidate list is replaced with a word (dictionary search word) existing in the word dictionary 3. However, the notation remains the original alphabetic character string including upper case letters. After processing, the process proceeds to step s31.

次に、本発明による具体的な形態素解析の処理例について説明する。なお、以下の例では、入力テキストとして、
「Restaurant TOKYO」本日オープン
を用い、単語辞書３には、「restaurant」、「Tokyo」という語が登録されており、「Restaurant」、「TOKYO」という語は登録されていないものとする。 Next, a specific processing example of morphological analysis according to the present invention will be described. In the following example, as input text,
It is assumed that the word “restaurant” and “Tokyo” are registered in the word dictionary 3, and the words “Restaurant” and “TOKYO” are not registered.

また、辞書検索・接続チェックアルゴリズムは、文頭→文末へ処理を行うアルゴリズムを利用するものとする。また、図３中のステップｓ１７、図４中のステップｓ３２の小文字化においては、対象のアルファベット文字列が全て大文字の場合には、先頭以外小文字化、全小文字化の順に適用し、先頭のみ大文字の場合には全小文字化のみ適用するものとする。 The dictionary search / connection check algorithm uses an algorithm for processing from the sentence head to the sentence end. In the case of lowercase letters in step s17 in FIG. 3 and step s32 in FIG. 4, if the target alphabetic character string is all uppercase letters, the lowercase letters are applied in the order of lowercase letters and lowercase letters. In case of, only lower case letters are applied.

＜処理例１＞
図５は単語連鎖生成処理として図３で説明した処理を用いた場合の例を示すもので、以下、代表的な処理について説明する。 <Processing Example 1>
FIG. 5 shows an example of the case where the processing described in FIG. 3 is used as the word chain generation processing, and typical processing will be described below.

まず、単語連鎖生成手段６によるステップｓ１１の処理では、対象文字位置として文頭を設定する。ステップｓ１２の処理では、辞書検索を行い、「「」（品詞：括弧）がマッチし、ステップｓ１３からステップｓ１９の処理に移り、「「」（品詞：括弧）を単語連鎖候補列に追加し、対象文字位置を１文字後ろにずらす。 First, in the process of step s11 by the word chain generation means 6, a sentence head is set as the target character position. In the process of step s12, a dictionary search is performed, ““ ”(part of speech: parentheses) matches, the process proceeds from step s13 to step s19, and“ “” (part of speech: parentheses) is added to the word chain candidate string. Shift the target character position one character backward.

次に、ステップｓ２１からステップｓ１２の処理に移り、辞書検索を行い、ステップｓ１３の処理に移る。ここでは接続可能な単語が単語辞書３に存在しないため、ステップｓ１４の処理に移る。 Next, the process proceeds from step s21 to step s12, a dictionary search is performed, and the process proceeds to step s13. Here, since there is no connectable word in the word dictionary 3, the process proceeds to step s14.

ステップｓ１４の処理では、対象文字列先頭が「R」でありアルファベットであるため、ステップｓ１５の処理に移る。 In the process of step s14, since the head of the target character string is “R” and is an alphabet, the process proceeds to step s15.

ステップｓ１５の処理では、「R」から連続するアルファベット文字列「Restaurant」を抽出し、ステップｓ１６の処理に移る。 In the process of step s15, a continuous alphabet character string “Restaurant” is extracted from “R”, and the process proceeds to step s16.

ステップｓ１６の処理では、「R」が大文字であるため、ステップｓ１７の処理に移る。 In the process of step s16, since “R” is a capital letter, the process proceeds to step s17.

ステップｓ１７の処理では、「Restaurant」は先頭のみ大文字であるため、全てを小文字化した「restaurant」という文字列に対して辞書検索を行い、ステップｓ１８の処理に移る。 In the process of step s17, since “Restaurant” is capitalized only at the beginning, a dictionary search is performed for the character string “restaurant” in which all characters are converted to lowercase, and the process proceeds to step s18.

ステップｓ１８の処理では、「restaurant」（品詞：名詞）がマッチするため、ステップｓ１９の処理に移り、「Restaurant」（品詞：名詞）を単語連鎖侯補列に追加し、対象文字位置を１０文字後ろにずらし、ステップｓ２１からステップｓ１２の処理に移る。 In the process of step s18, “restaurant” (part of speech: noun) matches. Therefore, the process proceeds to step s19, where “Restaurant” (part of speech: noun) is added to the word chain complement, and the target character position is 10 characters. Shifting backward, the process proceeds from step s21 to step s12.

ステップｓ１２の処理では、辞書検索を行い、マッチする単語として「」（品詞：空白）が存在するので、ステップｓ１３からステップｓ１９の処理に移り、「」（品詞：空白）を単語連鎖侯補列に追加し、対象文字位置を１文字後ろにずらす。 In the process of step s12, a dictionary search is performed, and “” (part of speech: blank) exists as a matching word. Therefore, the process proceeds from step s13 to step s19, and “” (part of speech: blank) is replaced with the word chained complement. The target character position is shifted backward by one character.

ステップｓ１４の処理では、対象文字列先頭が「T」でありアルファベット文字であるため、ステップｓ１５の処理に移る。 In the process of step s14, since the target character string head is “T” and is an alphabetic character, the process proceeds to step s15.

ステップｓ１５の処理では、「T」から連続するアルファベット文字列「TOKYO」を抽出し、ステップｓ１６の処理に移る。 In the process of step s15, a continuous alphabet character string “TOKYO” is extracted from “T”, and the process proceeds to step s16.

ステップｓ１６の処理では、「T」が大文字であるため、ステップｓ１７の処理に移る。 In the process of step s16, since “T” is a capital letter, the process proceeds to step s17.

ステップｓ１７の処理では、「TOKYO」は全て大文字であるため、先頭以外小文字化を行い、「Tokyo」という文字列に対して辞書検索を行い、「Tokyo」（品詞：名詞）がマッチするため（「tokyo」という全小文字化は行わず）、ステップｓ１８からステップｓ１９の処理に移り、「TOKYO」（品詞：名詞）を単語連鎖候補列に追加し、対象文字位置を文字後ろにずらし、ステップｓ２１からステップｓ１２の処理に移る。 In the processing of step s17, since “TOKYO” is all capital letters, lower case letters are changed to the lower case, a dictionary search is performed for the character string “Tokyo”, and “Tokyo” (part of speech: noun) matches ( (“Tokyo” is not converted to lower case), the process proceeds from step s18 to step s19, “TOKYO” (part of speech: noun) is added to the word chain candidate string, the target character position is shifted backward, and step s21. To step s12.

これ以降の処理については、従来の辞書検索・接続チェックアルゴリズムと同様の処理となるので、記述を省略する。 Since the subsequent processing is the same as the conventional dictionary search / connection check algorithm, description thereof is omitted.

この結果、図５に示す単語連鎖侯補列４１が生成される。 As a result, the word chain supplementary sequence 41 shown in FIG. 5 is generated.

次に、単語選択手段７によるステップｓ６の処理では、従来の場合と同様な単語選択処理によって、図５に示す単語情報列４２を出力する。 Next, in the process of step s6 by the word selection means 7, the word information sequence 42 shown in FIG. 5 is output by the word selection process similar to the conventional case.

＜処理例２＞
図６は単語連鎖生成処理として図４で説明した処理を用いた場合の例を示すもので、以下、代表的な処理について説明する。 <Processing example 2>
FIG. 6 shows an example of the case where the processing described in FIG. 4 is used as the word chain generation processing. Hereinafter, representative processing will be described.

まず、単語連鎖生成手段６によるステップｓ１１〜ｓ２１の処理により、図６に示す単語連鎖候補列（ステップｓ２１後）５１が生成され、ステップｓ３１の処理に移る。 First, the word chain candidate string (after step s21) 51 shown in FIG. 6 is generated by the processing of steps s11 to s21 by the word chain generation means 6, and the process proceeds to step s31.

ステップｓ３１の処理では、未処理の大文字アルファベット未知語「Restaurant」が存在するため、ステップｓ３２の処理に移る。 In the process of step s31, since there is an unprocessed uppercase alphabet unknown word “Restaurant”, the process proceeds to step s32.

ステップｓ３２の処理では、「Restaurant」は先頭のみ大文字であるため、全てを小文字化した「restaurant」という文字列に対して辞書検索を行い、ステップｓ３３の処理に移る。 In the process of step s32, since “Restaurant” is capitalized only at the beginning, a dictionary search is performed for the character string “restaurant” in which all characters are converted to lowercase, and the process proceeds to step s33.

ステップｓ３３の処理では、「restaurant」（品詞：名詞）が存在し、前方の「「」（品詞：括弧）及び後方の「」（品詞：空白）と接続可能であるため、ステップｓ３４の処理に移る。 In the process of step s33, “restaurant” (part of speech: noun) exists and can be connected to the front ““ ”(part of speech: parentheses) and the rear“ ”(part of speech: blank). Move.

ステップｓ３４の処理では、未知語の「Restaurant」を辞書検索された「restaurant」（品詞：名詞）に置き換え、表記のみ「Restaurant」に変更してステップｓ３１の処理に移る。 In the process of step s34, the unknown word “Restaurant” is replaced with “restaurant” (part of speech: noun) searched for the dictionary, and only the notation is changed to “Restaurant”, and the process proceeds to step s31.

ステップｓ３１の処理では、未処理の大文字アルファベット未知語「TOKYO」が存在するため、ステップｓ３２の処理に移る。 In the process of step s31, since there is an unprocessed uppercase alphabet unknown word “TOKYO”, the process proceeds to step s32.

ステップｓ３２の処理では、「TOKYO」は全て大文字であるため、先頭以外小文字化を行い、「Tokyo」という文字列に対して辞書検索を行い、ステップｓ３３の処理に移る。 In the process of step s32, since “TOKYO” is all capital letters, lowercase letters other than the head are converted, a dictionary search is performed for the character string “Tokyo”, and the process proceeds to step s33.

ステップｓ３３の処理では、「Tokyo」（品詞：名詞）が存在し、前方の「」（品詞：空白）と後方の「」」（品詞：括弧）と接続可能であるため、ステップｓ３４の処理に移る。 In the process of step s33, “Tokyo” (part of speech: noun) exists and can be connected to the front “” (part of speech: blank) and the rear “” (part of speech: parentheses). Move.

ステップｓ３４の処理では、未知語の「TOKYO」を辞書検索された「Tokyo」（品詞：名詞）に置き換え、表記のみ「TOKYO」に変更してステップｓ３１の処理に移る。 In the process of step s34, the unknown word “TOKYO” is replaced with “Tokyo” (part of speech: noun) searched for the dictionary, and only the notation is changed to “TOKYO”, and the process proceeds to step s31.

ステップｓ３１の処理では、未処理の大文字アルファベット未知語が存在しないため、処理を終了する。 In the process of step s31, since there is no unprocessed uppercase alphabet unknown word, the process ends.

この結果、図６に示す単語連鎖侯補列（図３のフロー終了後）５２が生成される。 As a result, the word chain supplementary sequence (after the flow in FIG. 3) 52 shown in FIG. 6 is generated.

次に、単語選択手段７にるステップｓ６の処理では、従来の場合と同様な単語選択処理によって、図６に示す単語情報列５３を出力する。 Next, in the process of step s6 by the word selection means 7, the word information string 53 shown in FIG. 6 is output by the word selection process similar to the conventional case.

なお、本発明中の第１及び第２の記憶手段という記載は、どのようなデータを記憶するかという機能上の違いに基づく表現であり、ハードウェア的に２つの個別の記憶装置が必要であるという意味ではない。また、実施の形態では、単語連鎖生成手段及び単語選択手段を中央演算処理装置（ＣＰＵ）上でプログラムにより構成した例を示したが、それぞれハードウェアで構成しても良いことはいうまでもない。 Note that the description of the first and second storage means in the present invention is an expression based on a functional difference in what data is stored, and two separate storage devices are necessary in terms of hardware. It doesn't mean there is. In the embodiment, the example in which the word chain generation unit and the word selection unit are configured by a program on the central processing unit (CPU) has been described, but it goes without saying that each may be configured by hardware. .

本発明の実施の形態にかかる形態素解析装置を示すブロック構成図The block block diagram which shows the morphological analyzer concerning embodiment of this invention 本発明の実施の形態にかかる形態素解析装置のプログラムに対応するフローチャートThe flowchart corresponding to the program of the morphological analyzer according to the embodiment of the present invention 図２中の単語連鎖生成処理の詳細内容の一例を示すフローチャートThe flowchart which shows an example of the detailed content of the word chain production | generation process in FIG. 図２中の単語連鎖生成処理の詳細内容の他の例を示すフローチャートThe flowchart which shows the other example of the detailed content of the word chain production | generation process in FIG. 本発明による具体的な形態素解析処理の一例を示す説明図Explanatory drawing which shows an example of the concrete morphological analysis process by this invention 本発明による具体的な形態素解析処理の他の例を示す説明図Explanatory drawing which shows the other example of the concrete morphological analysis process by this invention

Explanation of symbols

１：第１の記憶手段、２：第２の記憶手段、３：単語辞書、４：文法規則、５：中央処理装置（ＣＰＵ）、６：単語連鎖生成手段、７：単語選択手段。 1: first storage means, 2: second storage means, 3: word dictionary, 4: grammar rules, 5: central processing unit (CPU), 6: word chain generation means, 7: word selection means.

Claims

In a morphological analyzer that divides an inputted Japanese sentence into words and outputs a word information string to which information such as reading and part of speech is given for each word,
First storage means for storing the input Japanese sentence;
The Japanese sentence is read from the first storage means, a word satisfying a connectable part-of-speech chain is extracted from the Japanese sentence using a word dictionary and grammatical rules, and at this time, a character string consisting of alphabets including capital letters For an unknown character string whose notation is not registered in the word dictionary, the word dictionary is searched again after lowering the capital letters in the character string, and the extracted word is read or part of speech. A word chain generation means for generating a word chain candidate string that is a word chain candidate combined in correspondence from the beginning to the end of the sentence along with information such as
Second storage means for storing the generated word chain candidate sequence;
A word information sequence consisting of a series of word chains corresponding to the sentence from the beginning to the end of the Japanese sentence, by reading a word chain complement from the second storage means, performing a predetermined cost calculation on the word chain complement A morpheme analyzer comprising: word selection means for selecting and outputting at least one of the words.

Reading the Japanese sentence from the first storage means, using the word dictionary and grammatical rules, extracting words that satisfy the connectable part-of-speech chain from the Japanese sentence, and extracting the extracted words together with information such as their readings and parts of speech A word chain candidate string that is a word chain candidate combined in correspondence from the beginning to the end of the sentence of the Japanese sentence is generated, and is a character string composed of alphabets including capital letters in the word chain candidate string and is stored in the word dictionary If an unknown character string with no notation registered is included, the word dictionary is searched again after lowering the capital letters in the character string, and when the corresponding word exists, the unknown alphabet character string The morphological analysis device according to claim 1, further comprising a word chain generation unit that replaces the word with the word.

In a morpheme analysis method for dividing an input Japanese sentence into words and outputting a word information string to which information such as reading and part of speech is added for each word,
Using a computer comprising a word dictionary, grammatical rules, first storage means, and second storage means,
The calculator
Storing the input Japanese sentence in the first storage means;
The Japanese sentence is read from the first storage means, a word satisfying a connectable part-of-speech chain is extracted from the Japanese sentence using a word dictionary and grammatical rules, and at this time, a character string consisting of alphabets including capital letters For an unknown character string whose notation is not registered in the word dictionary, the word dictionary is searched again after lowering the capital letters in the character string, and the extracted word is read or part of speech. Generating a word chain candidate string that is a word chain candidate combined in correspondence from the beginning to the end of the sentence along with information such as
Storing the generated word chain candidate string in a second storage means;
A word information sequence consisting of a series of word chains corresponding to the sentence from the beginning to the end of the Japanese sentence, by reading a word chain complement from the second storage means, performing a predetermined cost calculation on the word chain complement And a step of selecting and outputting at least one of the morpheme analysis methods.

Reading the Japanese sentence from the first storage means, using the word dictionary and grammatical rules, extracting words that satisfy the connectable part-of-speech chain from the Japanese sentence, and extracting the extracted words together with information such as their readings and parts of speech A word chain candidate string that is a word chain candidate combined in correspondence from the beginning to the end of the sentence of the Japanese sentence is generated, and is a character string composed of alphabets including capital letters in the word chain candidate string and is stored in the word dictionary If an unknown character string with no notation registered is included, the word dictionary is searched again after lowering the capital letters in the character string, and when the corresponding word exists, the unknown alphabet character string The morphological analysis method according to claim 3, further comprising a step of replacing the word with the word.