JP3933406B2

JP3933406B2 - Pronoun rewriting device and method, and program used therefor

Info

Publication number: JP3933406B2
Application number: JP2001065009A
Authority: JP
Inventors: 毅彦吉見
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2001-03-08
Filing date: 2001-03-08
Publication date: 2007-06-20
Anticipated expiration: 2021-03-08
Also published as: JP2002269086A

Description

【０００１】
【発明の属する技術分野】
本発明は代名詞書換装置に関し、詳しくは、機械翻訳装置による翻訳において、第１言語国と第２言語国の言語習慣(言語的特徴)の相違に起因する不自然な翻訳文の生成を抑える代名詞書換装置及び方法並びにこれに利用されるプログラムに関する。
【０００２】
【従来の技術】
従来の機械翻訳装置を利用して英語を日本語に翻訳した場合に生じる問題点の一つとして、代名詞に関する問題がある。
具体的には、英語の代名詞をそのまま日本語の代名詞として訳出すると、英語文が伝えている意味と異なる意味を伝える日本語文が生成されたり、あるいは、文意は同じでも不自然で読みにくい日本語文が生成されてしまうという問題がある。
例えば、次の文（E1）を従来の機械翻訳装置で処理すると、文（J1）のように訳される。
【０００３】
（E1）By the time the average American reaches the age of 70, he
consumes 13 tons of beaf.
（J1）平均的アメリカ人は70歳になるまでに、彼は１３トンの牛肉を消費している。
【０００４】
文（E1）において、“the average American”は総称的な意味で用いられている（特定の人物を指しているのではない）。英語では、先行名詞が総称的な意味を持つ場合でも、代名詞はその先行名詞を指すことができる。
これに対して、文献「日英語代名詞の研究」（神崎高明、研究社出版、1994）によれば、日本語では、代名詞は、先行名詞が総称的な意味を持つ場合、その先行名詞を指すことができない。
【０００５】
従って、文（E1）では“he”は“the average American”（と同一人物）を指していると解釈されるが、文（J1）では「彼」が「平均的アメリカ人」（と同一人物）を指しているとは解釈できない。このため、文（J1）は文（E1）の意味を正しく伝えていない翻訳である。
【０００６】
また、次の文（E2）では“she”は“Mary”を指している。
これに対して、文献「日英語代名詞の研究」（神崎高明、研究社出版、1994）によれば、日本語では後方照応（代名詞がそれより後方に現れる名詞を指すこと）は基本的に不可能であるため、文（J2）では「彼女」が「メアリー」を指しているようには解釈できない。
（E2） Mary scolded John before she left home.
（J2）彼女が家を出る前に、メアリーはジョンを叱った。
【０００７】
このような問題に対処するための従来技術としては、特開平6-203068号公報に記載の「機械翻訳装置」や文献「形態素を利用した自然な和文訳のためのゼロ代名詞生成規則」（宮東城、永田守男、電子情報通信学会、技術研究報告NLC2000-5,2000）などに示される技術がある。
これら従来技術は、英語の代名詞を常に日本語の代名詞として訳すのではなく、不要な場合には訳さないことによって、上記の問題に対処しようとするものである。
【０００８】
【発明が解決しようとする課題】
しかしながら、上記の従来技術には次の二つの問題点がある。
前者の従来技術は、代名詞を訳す（残す）か、訳さない（削除する）かの二値の判定で行われている。
この前者の従来技術による翻訳処理では、文（J2）のような場合に適切に対処できないことがある。
【０００９】
文（J2）から「彼女」を削除することは必ずしも適切ではない。削除すると「家を出る」の主語が「メアリー」であるのか「ジョン」であるのかが曖昧になってしまうからである。
文（E2）の文意と同じ文意を伝え、かつ、主語の曖昧性が生じるのを抑えるためには、「彼女」を、「出る」の主語が「メアリー」であると特定できる表現、例えば、「自分」などに置き換える必要がある。
【００１０】
後者の従来技術は、代名詞を訳すか、訳さないかの判定を行う規則が人手で記述されている。代名詞を訳すか、訳さないかは様々な要因によって決まるため、複雑に関連し合う要因を人手で整理し、その結果に基づいて一貫性のある（矛盾のない）規則を記述し、さらに維持管理するには非常に大きな労力を要するという問題がある。
【００１１】
本発明は以上の事情を考慮してなされたものであり、例えば、予め備えた事例文テーブルに記憶された代名詞書換事例文から木形式で表記した形態素・語彙属性からなる代名詞書換規則を作成することにより、自然言語文の機械翻訳の後処理において、より自然な代名詞の書き換えができる代名詞書換装置を提供する。
【００１２】
【課題を解決するための手段】
本発明は、自然言語文を入力する入力部と、辞書テーブル、形態素解析規則テーブル、属性テーブル及び事例文テーブルを記憶したテーブルメモリと、辞書テーブル及び形態素解析規則テーブルを参照し、入力された自然言語文の形態素解析を行う形態素解析部と、形態素解析結果から属性テーブルに定義された形態素・語彙属性を抽出する属性抽出部と、事例文テーブルに記憶された代名詞書換事例文から木形式で表記した形態素・語彙属性からなる代名詞書換規則の決定木を作成する決定木作成部と、決定木を参照し、自然言語文に含まれる各代名詞に対応する書換規則を前記抽出された形態素・語彙属性に基づいて決定する書換規則決定部と、決定した書換規則に基づいて対象の代名詞を書き換える代名詞書換部とを備えたことを特徴とする代名詞書換装置である。
【００１３】
本発明によれば、予め備えた事例文テーブルに記憶された代名詞書換事例文から木形式で表記した形態素・語彙属性からなる代名詞書換規則を作成することにより、自然言語文の機械翻訳の後処理において、より自然な代名詞の書き換えができる。
これにより、特定の機械翻訳装置からの独立性が高く、各種機械翻訳装置の後編集に汎用的に利用することができる。
【００１４】
前記決定木作成部は、前記事例文テーブルの代名詞書換事例文を参照し、統計的機械学習法の一つである決定木学習法によって代名詞書換規則の決定木を作成する構成にしてもよい。
この構成によれば、代名詞書換規則の決定木を統計的機械学習法の一つである決定木学習法によって作成することができるので、代名詞書換規則の決定木の作成・維持管理が簡単になる。
【００１５】
前記事例文テーブルは、自然言語文中の各代名詞に対して、代名詞を文中から削除する処理を示すラベル、代名詞を残す処理を示すラベル、あるいは代名詞を他の代名詞に置き換える処理を示すラベルのいずれかを付与した代名詞書換事例文を記憶した構成にしてもよい。
この構成によれば、代名詞を消すか残すかの二値の判定だけではなく、他の表現に置き換える多値の判定も可能になるので、より自然な代名詞の書き換えができる。
例えば、従来技術では行えなかった代名詞を他の代名詞に置き換える（例えば、文（J2）に含まれる「彼女」の再帰代名詞「自分」への置換）ようにすることができ、より品質の高い訳文を生成することができる。
【００１６】
本発明の別の観点によれば、自然言語文で記述された原語文を入力する入力部と、原言語解析規則テーブル、原言語・目的言語変換規則テーブル、属性テーブル及び事例文テーブルを記憶したテーブルメモリと、原言語解析規則テーブルに基づいて原言語文を解析し、解析結果から得られた原言語文の構文木を原言語・目的言語変換規則テーブルに基づいて目的言語の構文木に変換する解析・変換部と、目的言語の構文木から属性テーブルに定義されている形態素・語彙属性及び構文属性を抽出する属性抽出部と、事例文テーブルに記憶された代名詞書換事例文から木形式で表記した形態素・語彙属性及び構文属性からなる代名詞書換規則の決定木を作成する決定木作成部と、決定木を参照し、目的言語文に含まれる各代名詞に対応する書換規則を前記構文木から抽出された形態素・語彙属性及び構文属性属性に基づいて決定する書換規則決定部と、決定した書換規則に基づいて対象の代名詞を書き換える代名詞書換部とを備えたことを特徴とする機械翻訳装置が提供される。
本発明によれば、機械翻訳において、予め備えた事例文テーブルに記憶された代名詞書換事例文から形態素・語彙属性だけでなく構文属性も参照して木形式の代名詞書換規則を作成するので、より精度の高い代名詞の書き換えが可能になり、より自然な翻訳が可能になる。
【００１７】
【発明の実施の形態】
以下、図に示す実施例に基づいて本発明を詳述する。なお、本発明はこれによって限定されるものではない。
【００１８】
図１は本発明の実施例１である代名詞書換装置の構成を示すブロック図である。図１に示すように、代名詞書換装置は、制御部１、入力部２、出力部３、テーブルメモリ４、プログラムメモリ５、バッファメモリ６、バス７、記憶媒体８から構成される。
【００１９】
制御部１は、例えば、ＣＰＵで構成され、プログラムメモリ５から制御プログラムを読み出し、この制御プログラムに従って、バス７を介して各部を制御することによって本発明の代名詞書き換え機能を実現する。
【００２０】
入力部２は、例えば、キーボード、タブレット、データを受信する通信インタフェースなどで構成され、自然言語で記述された文の入力を行う。
出力部３は、例えば、液晶ディスプレイ、プラズマディスプレイ、ＥＬディスプレイなどからなる表示装置、熱転写プリンタ、レーザプリンタなどからなる印字装置、データを送信する通信インタフェースなどで構成され、プログラムメモリ５に記憶されている制御プログラムによって処理された結果を出力する。
【００２１】
テーブルメモリ４は、例えば、ＲＯＭ、ＥＥＰＲＯＭ、フロッピーディスク、ハードディスク、ＣＤ−Ｒ／Ｗなどで構成され、形態素・語彙情報を記憶した辞書テーブル４ａ、自然言語文（訳文）を形態素解析するための規則を記憶した形態素解析規則テーブル４ｂ、訳文から抽出すべき形態素・語彙属性を定義した属性テーブル４ｃ、訳文中の代名詞に対する正しい書き換えを示すラベルが付与された代名詞書換事例文の集合（正解付きコーパス）を記憶した事例文テーブル（正解付きコーパステーブル）４ｄとして機能する領域を備えている。
【００２２】
プログラムメモリ５には、制御部１が、辞書テーブル４ａ及び形態素解析規則テーブル４ｂを参照し、入力部２により入力された訳文の形態素解析を行う形態素解析部５ａ、形態素解析結果から属性テーブル４ｃに定義された形態素・語彙属性を抽出する属性抽出部５ｂ、事例文テーブル４ｄに記憶された代名詞書換事例文から木形式で表記した形態素・語彙属性からなる代名詞書換規則の決定木を作成する決定木作成部５ｃ、決定木を参照し、自然言語文に含まれる各代名詞に対応する書換規則を前記抽出された形態素・語彙属性に基づいて決定する書換規則決定部５ｄ、決定した書換規則に基づいて対象の代名詞を書き換える代名詞書換部５ｅとして機能するプログラムが記憶されている。
【００２３】
バッファメモリ６は、例えば、ＲＡＭ、ＥＥＰＲＯＭ、フロッピーディスク、ハードディスク、ＣＤ−Ｒ／Ｗなどで構成され、入力部２によって入力された文を記憶する原文バッファ６ａ、形態素解析部５ａによって得られた形態素解析結果を記憶する形態素解析結果バッファ６ｂ、属性抽出部５ｂによって抽出された形態素・語彙属性を記憶する属性バッファ６ｃ、属性抽出部５ｂによって文Ｓから抽出された属性と正解付きコーパスの各文Ｓに付与されている代名詞に対する正しい書き換えを示すラベルとの組を記憶する事例データバッファ６ｄ、決定木作成部５ｃによって得られた代名詞書換規則の決定木を記憶する決定木バッファ６ｅ、書換規則決定部５ｄによって決定された代名詞に対する書換規則の決定結果を記憶する書換規則決定結果バッファ６ｆ、代名詞書換部５ｅによって書き換えられた自然言語文を記憶する代名詞書換結果バッファ６ｇとして機能する領域を備えている。
【００２４】
プログラムメモリ５に記憶されているプログラム群は、１）代名詞書換事例文（正解付きコーパス）に基づいて代名詞に対する書換規則を自動的に作成するプログラム群と、２）作成された代名詞書換規則を入力文に適用して入力文の代名詞を書き換えるプログラム群に大別される。
本発明では、決定木と呼ばれる表現形式で代名詞書換規則を記述する。書換規則作成用プログラムは、決定木作成部として機能するプログラムであり、代名詞書換用プログラムは、書換規則決定部として機能するプログラムと、代名詞書換部として機能するプログラムである。
形態素解析部として機能するプログラムと属性抽出部として機能するプログラムは、代名詞書換規則作成時と代名詞書換時の両方で用いられる。
【００２５】
［代名詞書換規則の自動作成］
形態素解析部５ａは、辞書テーブル４ａと形態素解析規則テーブル４ｂに基づいて、事例文テーブル４ｄに格納されている各文に対して形態素解析を行い、結果を形態素解析結果バッファ６ｂに格納する。
また、原文バッファ６ａに格納されている入力文に対しても同様に形態素解析を行う。
形態素解析は、非常によく知られている一般的な技術であるので説明は略す。形態素解析については、例えば、文献「自然言語処理」（長尾眞、岩波書店、
1997）などに解説がある。
【００２６】
属性抽出部５ｂは、属性テーブル４ｃに定義されている形態素・語彙属性を形態素解析結果から抽出し、属性バッファ６ｃあるいは事例データバッファ６ｄに格納する。
属性は、正解付きコーパス中の文や入力文が持つ様々な性質である。
事例データは、属性とクラスによって表現される。
クラスは、正解付きコーパス中の文に付与されているラベルに対応する。代名詞書換事例文（正解付きコーパス）は、機械翻訳装置から出力された訳文に含まれる代名詞に対して、その代名詞を削除すべきか、そのままでよいか、あるいは他の表現に置き換えるべきかなどを示すラベルを人手で付与した文の集合である。
【００２７】
例えば、文（J1）と文（J2）に対しては、それぞれ次のようにラベルを付加することができる。
（J1'）平均的アメリカ人は70歳になるまでに、＃Ｄ彼は１３トンの牛肉を消費している。
（J2'）＃Ｊ彼女が家を出る前に、メアリーはジョンを叱った。
【００２８】
ここで、「彼は」に付与されているラベル“＃Ｄ"は「彼は」を削除すべきことを意味し、「彼女が」に付与されているラベル“＃Ｊ"は「彼女が」を「自分が」に置き換えるべきことを意味する。
この例では現れていないが、削除する必要のない代名詞にはラベル“＃Ｎ”が付与されるものとする。なお、ラベルの種類はこれら三種類に限定されるものではない。
【００２９】
図２は本実施例の事例データバッファに記憶される事例データの一例を示す図である。例えば、代名詞書換事例文（正解付きコーパス）中の各文において、今着目している代名詞「彼」には削除すべきことを意味する“＃Ｄ”が付与されているので、文（J1）から抽出された属性の組「彼」、「は」、「有」に対して、「消」というクラスが対応する。このような処理を施していくと、図２に示す事例データが得られ、事例データバッファ６ｄに記憶される。
【００３０】
図２に示す事例データにおいて、クラスは代名詞を「消」すか「残」すかの二値となっているが、正解付きコーパス中の代名詞に付与されているラベルが多値ならば、それが自動的に事例データ集に反映されるので、従来技術の問題は、正解付きコーパス作成時に適切なラベルを設定することによって解決される。
【００３１】
決定木作成部５ｃは、事例データバッファ６ｄに記憶されている事例データを分類することによって、決定木の形式で代名詞書換規則を帰納的に作成し、決定木バッファ６ｅに記憶する。
決定木は、クラスを表わす終端節点と、一つの属性を調べるテストに対応する非終端節点（判別節点）とからなる。
【００３２】
書換規則決定部５ｄは、属性抽出部５ｂによって入力文から抽出された属性に基づいて決定木を根節点から終端節点に向けて判別節点でのテストの結果に従いながら検索し、入力文に含まれる代名詞に対する書換規則（削除（＃Ｄ）、他の表現への置換（＃Ｊ）、そのまま保持（＃Ｎ）など）を決定し、その結果を書換規則決定結果バッファ６ｆに格納する
【００３３】
代名詞書換部５ｅは、書換規則決定部５ｄで下された決定に従って、代名詞の書き換え処理を行い、その結果を代名詞書換結果バッファ６ｇに格納する。
属性テーブル４ｃには、文から抽出すべき属性があらかじめ定義されている。例えば、「代名詞の表記」、「代名詞に結合している付属語」、「代名詞が指しうる先行（代）名詞の有無」の三種類が抽出すべき属性であると定められているものとする。
【００３４】
このとき、文（J1）の形態素解析結果に対して属性抽出部５ｂが属性テーブル４ｃに従って処理を行うと、「彼」、「は」、「有」が抽出される。ここで、「有」は「平均的アメリカ人」の存在を意味する。
また、決定木作成部５ｃでは、文献「ＡＩによるデータ解析」（J.R.Quinlan著、古川康一監訳、トッパン、1995）に示されるＣ４.５と呼ばれる方法に従って事例データから決定木を作成する。
【００３５】
Ｃ４.５による決定木の作成は、事例集合Ｔをｎ個の部分集合に分割するテストＸを利得基準に従って順次選択していくことによって行われる。
利得基準とは、次式で表わされるinfo(Ｔ)とinfo_X(Ｔ)との差（利得）
gain(Ｘ)＝info(Ｔ)−info_X(Ｔ)
が最大になるようなテストＸを選ぶ基準である。
【００３６】

ここで、freq(Ｃ_j,Ｔ)は事例集合Ｔの中でクラスＣ_jに属する事例の数を意味し、|Ｔ|は事例集合Ｔに含まれる全事例数を意味する。
【００３７】
図３は本実施例の代名詞書換規則の決定木作成手順を示すフローチャートである。Ｃ４.５に従って決定木を作成する。
ステップＳ１０１で、すべての事例を根節点に割り当てる。
ステップＳ１０３で、事例集合Ｔに対してgain_X(Ｔ)を最大にするテストＸを選択する。
【００３８】
図２に示す事例データが存在するとき、gain(Ｘ)は、次のように計算される。図２の事例データには、クラス「消」の９事例データ、クラス「残」の５事例データが存在するので、

である。
【００３９】
ステップＳ１０４で、そのテストＸで事例集合を部分集合に分割し、各部分集合を節点として決定木を成長させる。
属性「表記」の値に従って事例集合を三つに分割したとき、

となる。従って、属性「表記」に基づくテストによる分割で得られる利得は、
gain（表記）=info(Ｔ)−info[表記](Ｔ)＝0.94−0.694＝0.246となる。
【００４０】
属性「付属語」の値や属性「先行（代）名詞」の値に従って事例データ集を分割する場合の利得を同様に計算すると、属性「表記」の場合より大きな利得は得られない。従って、属性「表記」についてのテストが根節点で行うテストとして選ばれる。
【００４１】
ステップＳ１０２で、終了条件が満たされるまで、同様に処理を進めていけば、最終的な決定木が作成される。
【００４２】
図４は本実施例の代名詞書換規則の定木作成結果の一例を示す図である。図４に示す代名詞書換規則の決定木は、図２に示す事例データ集から作成され、決定木バッファ６ｅに記憶される。決定木の非終端節点にはテストに相当する属性が、終端節点にはクラス名が記述されており、枝には属性値が付与されている。
【００４３】
［代名詞書換規則の適用］
上記の方法で事例データ集から自動的に作成された代名詞書換規則の決定木を入力文に適用して代名詞を書き換える処理は、形態素解析部５ａ、属性抽出部５ｂ、書換規則決定部５ｄ、代名詞書換部５ｅによって実行される。
形態素解析部５ａと属性抽出部５ｂは、正解付きコーパス中の各文に対して行った処理と同じ処理を入力文に対して行う。これによって、入力文から属性が抽出され、属性バッファ６ｃに格納される。
例えば、次の文（J3）が入力文であるとする。
【００４４】
（E3）She went home because she didn't feel well.
（J3）彼女が気分が良くなかったので、彼女は帰宅した。
文（J3）には、代名詞「彼女」が二回出現しているので、
代名詞の表記＝彼女
代名詞に結合している付属語＝が
代名詞が指しうる先行（代）名詞の有無＝無
という属性値と、
代名詞の表記＝彼女
代名詞に結合している付属語＝は
代名詞が指しうる先行（代）名詞の有無＝有
という属性値が得られる。
【００４５】
書換規則決定部５ｄは、属性抽出部５bによって入力文から抽出された属性に基づいて決定木を根節点から終端節点に向けて判別節点でのテストの結果に従いながら辿っていく。
【００４６】
図５は本実施例の書換規則決定手順を示すフローチャートである。
文（J3）から抽出された上記の最初の属性値の場合、図４の決定木を辿る過程は次のようになる。
まず、ステップＳ２０１で、根節点を着目節点とする。
現在の着目節点は終端節点ではないので、ステップＳ２０２を経てステップＳ２０３に行く。
ステップＳ２０３で、着目節点での属性（テスト）は「表記」であり、入力文の属性「表記」の値は「彼女」であるので、「彼女」が付与されている枝を辿り「先行（代）名詞」の節点を次の着目節点とする。
【００４７】
ステップＳ２０２を経てステップＳ２０３に戻り、現在の着目節点で「先行（代）名詞」のテストを行うと、入力文の属性「先行（代）名詞」の値は「無」であるので、「無」が付与されている枝を辿り、終端節点「消」を次の着目節点とする。
ステップＳ２０２で、現在の着目節点は終端節点であるので、走査を終了し、到達した終端節点に記述されているクラス名「消」を対象代名詞と共に書換規則決定結果バッファ６ｆに格納する。
【００４８】
二つ目の属性値の場合も同様に決定木を辿り、クラス名「残」を対象代名詞と共に書換規則決定結果バッファ６ｆに格納する。
代名詞書換部５ｅは、書換規則決定結果バッファ6fに記憶されている対象代名詞とそれに対するクラス名すなわち書き換えの種類に応じて、対象代名詞を書き換える。
以上の処理によって、文（J2）から代名詞を含む文節「彼女が」が削除され、次のような文が代名詞書換結果バッファ６ｇに記憶される。
（J3'）気分が良くなかったので、彼女は帰宅した。
【００４９】
図６は本発明の実施例２である機械翻訳装置の構成を示すブロック図である。実施例２の機械翻訳装置は、制御部１、入力部２、出力部３、テーブルメモリ４、プログラムメモリ５、バッファメモリ６、バス７、記憶媒体８から構成される。
制御部１、入力部２、出力部３は、実施例１の代名詞書換装置におけるそれぞれの各部と同じであるので、説明を省略する。
【００５０】
テーブルメモリ４は、原言語の文を解析する規則を記憶した原言語解析規則テーブル４ａ'、原言語文の構文木を目的言語の構文木へ変換するための規則を記憶した原言語・目的言語変換規則テーブル４ｂ'、構文木から抽出すべき形態素レベルの属性、構文属性を定義した属性テーブル４ｃ、文中の代名詞に対する正しい書き換えを示すラベルが付与された文の集合（正解付きコーパス）を記憶した事例文テーブル（正解付きコーパステーブル）４ｄとして機能する領域を備えている。
【００５１】
プログラムメモリ５には、解析・変換部５ａ、属性抽出部５ｂ、決定木作成部５ｃ、書換規則決定部５ｄ、代名詞書換部５ｅとして機能するプログラムが記憶されている。
【００５２】
解析・変換部５ａ'は、原文バッファ６ａに格納されている文に対して、原言語解析規則テーブル４ａ'に基づいて解析し、原言語・目的言語変換規則テーブル４ｂ'に基づいて原言語を目的言語に変換する。
属性抽出部５ｂは、属性テーブル４ｃに定義されている構文属性を構文木から抽出する。
決定木作成部５ｃ、書換規則決定部５ｄは、実施例１の装置における決定木作成部５ｃ、書換規則決定部５ｄと同じである。
代名詞書換部５ｅは、実施例１の装置における代名詞書換部の機能を含むものである。
【００５３】
バッファメモリ６は、原文バッファ６ａ、構文木バッファ６ｂ'、属性バッファ６ｃ、事例データバッファ６ｄ、決定木バッファ６ｅ、書換規則決定結果バッファ６ｆ、代名詞書換結果バッファ６ｇとして機能する領域を備えている。
【００５４】
原文バッファ６ａ、属性バッファ６ｃ、事例データバッファ６ｄ、決定木バッファ６ｅ、書換規則決定結果バッファ６ｆは、実施例１の装置におけるそれぞれと同じである。
構文木バッファ６ｂ'は、解析・変換部５ａ'によって解析・変換された構文木を記憶する。
代名詞書換結果バッファ６ｇは、代名詞書換部５ｅによって作成された文を記憶する。
【００５５】
実施例１の代名詞書換装置と実施例２の機械翻訳装置との主要な差異は、属性抽出部が抽出できる属性の性質である。
すなわち、実施例１における属性抽出部が抽出する属性は、形態素・語彙レベルのものであるのに対して、実施例２における属性抽出部が抽出する属性は、形態素・語彙レベルと構文レベルのものである。
【００５６】
実施例２における属性抽出部が抽出する属性としては、例えば、「代名詞の係り先の語句に関する情報」（係り先の語句の表記や品詞や意味コード）や、「代名詞が含まれる節に関する情報」（主節か従属節か）などがある。
【００５７】
例えば、文（J1）からは次のような構文レベルの属性が抽出できる。
代名詞の係り先の表記＝消費する
代名詞の係り先の品詞＝動詞
代名詞の係り先の意味コード＝ 1371007
代名詞が含まれる節＝主節
【００５８】
また、文（J2）から抽出される構文レベルの属性は次のようになる。
代名詞の係り先の表記＝出る
代名詞の係り先の品詞＝動詞
代名詞の係り先の意味コード＝ 2121001
代名詞が含まれる節＝従属節
【００５９】
代名詞をどのように書き換えるべきかを決定する要因は多岐にわたる。このように代名詞書換の精度を向上させるためには、できるだけ多くの種類の属性を参照する必要がある。この点で、実施例２の装置では、形態素・語彙レベルの属性だけでなく、構文レベルの属性も参照しているので、より精度の高い代名詞の書き換えが実現できる。
【００６０】
図７は実施例１の代名詞書換装置と実施例２の機械翻訳装置の関係を示す図である。図７に示すように、実施例１の代名詞書換装置は、機械翻訳において、原言語と目的言語の言語習慣の相違に起因する不自然な目的言語文の作成を抑えることを目的とするものであり、特定の機械翻訳装置の中に組み込んだものではなく、機械翻訳の後編集に適用することができる。
【００６１】
従って、実施例１は、機械翻訳後の後編集としての代名詞書換装置であり、機械翻訳装置からの独立性が高いという利点があるが、この独立性を保つために、参照可能な属性を形態素・語彙レベルのものに限定しているので、実施例２の機械翻訳装置による代名詞書換の精度より高くないと予想される。
【００６２】
実施例２の機械翻訳装置は、代名詞書換処理を機械翻訳処理の一機能として機械翻訳装置の中に組み込んだものであるが、さらに構文属性まで参照しているので、実施例２の装置による代名詞書換の精度は、実施例１の装置よりも高くなると期待できる。
【００６３】
以上説明した代名詞書換装置あるいは機械翻訳装置は、コンピュータ処理を機能させるためのプログラムで実現される。
発明の対象とするのは、このプログラムそのものであってもよいし、このプログラムをコンピュータで読み取り可能な記憶媒体８に格納されているプログラムであってもよい。
まず、本発明では、この記憶媒体８として、図１または図６に示されている制御部１のマイクロコンピュータで処理が行われるために必要なメモリ、例えば、ＲＯＭのようなものそのものがプログラムメディアであってもよいし、また、図示していないが外部記憶装置としてプログラム読み取り装置が設けられ、そこに記憶媒体を挿入することで読み取り可能なプログラムメディアであってもよい。いずれかの場合においても、格納されているプログラムはマイクロコンピュータがアクセスして実行させる構成であってもよいし、あるいはいずれの場合もプログラムを読み出し、読み出されたプログラムは、マイクロコンピュータの図示されていないプログラム記憶エリアにロードされて、そのプログラムが実行される方式であってもよい。このロード用のプログラムはあらかじめ本体装置に格納されているものとする。
【００６４】
ここで上記プログラムメディアは、本体と分離可能に構成される記憶媒体であり、磁気テープやカセットテープ等のテープ系、フロッピーディスクやハードディスク等の磁気ディスクやＣＤ−ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ等の光ディスクのディスク系、ＩＣカード（メモリカードを含む）／光カード等のカード系、あるいはマスクＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、フラッシュＲＯＭ等による半導体メモリを含めた固定的にプログラムを担持する媒体であってもよい。
また、本発明においては、インターネットを含む通信ネットワークと接続可能なシステム構成であることから、通信ネットワークからプログラムをダウンロードするように流動的プログラムを担持する媒体であってもよい。尚、このように通信ネットワークからプログラムをダウンロードする場合には、そのダウンロード用プログラムは予め装置本体に格納しておくか、あるいは別な記憶媒体からインストールされるものであってもよい。
尚、記憶媒体に格納されている内容としてはプログラムに限定されず、データであってもよい。
【００６５】
次に、本発明では、プログラム自体として、図１に示されている制御部１のマイクロコンピュータで実行される処理そのものであってもよいし、あるいは、インターネットを含む通信ネットワークとアクセスすることで取り込める。あるいは、取り込めたものであってもよいし、こちらから送り出すものであってもよい。さらには、この取り込んだプログラムに基づいて上記代名詞書換装置あるいは機械翻訳装置内で処理された結果、つまり生成されたものであってもよい。あるいは、こちらから送り出す際に上記代名詞書換装置あるいは機械翻訳装置内で処理された結果、つまり生成されたものであってもよい。
尚、これらのものはプログラムに限定されず、データであってもよい。
【００６６】
【発明の効果】
本発明によれば、予め備えた事例文テーブルに記憶された代名詞書換事例文から木形式で表記した形態素・属性からなる代名詞書換規則を作成することにより、自然言語文の機械翻訳の後処理において、より自然な代名詞の書き換えができる。
これにより、特定の機械翻訳装置からの独立性が高く、各種機械翻訳装置の後編集に汎用的に利用することができる。
【図面の簡単な説明】
【図１】本発明の実施例１である代名詞書換装置の構成を示すブロック図である。
【図２】本実施例の事例データバッファに記憶される事例データの一例を示す図である。
【図３】本実施例の代名詞書換規則の決定木作成手順を示すフローチャートである。
【図４】本実施例の代名詞書換規則の定木作成結果の一例を示す図である。
【図５】本実施例の書換規則決定手順を示すフローチャートである。
【図６】本発明の実施例２である機械翻訳装置の構成を示すブロック図である。
【図７】実施例１の代名詞書換装置と実施例２の機械翻訳装置の関係を示す図である。
【符号の説明】
１制御部
２入力部
３出力部
４テーブルメモリ
４ａ辞書テーブル
４ａ' 原言語解析規則テーブル
４ｂ形態素解析規則テーブル
４ｂ' 原言語・目的言語変換規則テーブル
４ｃ属性テーブル
４ｄ事例文テーブル（正解付きコーパステーブル）
５プログラムメモリ
５ａ形態素解析部
５ａ' 解析・変換部
５ｂ属性抽出部
５ｃ決定木作成部
５ｄ書換規則決定部
５ｅ代名詞書換部
６バッファメモリ
６ａ原文バッファ
６ｂ形態素解析結果バッファ
６ｂ' 構文木バッファ
６ｃ属性バッファ
６ｄ事例データバッファ
６ｅ決定木バッファ
６ｆ書換規則決定結果バッファ
６ｇ代名詞書換結果バッファ
７バス
８記憶媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a pronoun rewriting device, and more particularly, a pronoun that suppresses the generation of unnatural translations caused by differences in language customs (linguistic features) between a first language country and a second language country in translation by a machine translation device. The present invention relates to a rewriting device and method, and a program used therefor.
[0002]
[Prior art]
One of the problems that arise when English is translated into Japanese using a conventional machine translation device is related to pronouns.
Specifically, translating an English pronoun as a Japanese pronoun produces a Japanese sentence that conveys a meaning that is different from the meaning that the English sentence conveys, or is unnatural and difficult to read even if the sentence meaning is the same. There is a problem that words are generated.
For example, when the next sentence (E1) is processed by a conventional machine translation device, it is translated as a sentence (J1).
[0003]
(E1) By the time the average American reaches the age of 70, he
consumes 13 tons of beaf.
(J1) By the time an average American turns 70, he has consumed 13 tons of beef.
[0004]
In the sentence (E1), “the average American” is used generically (not referring to a specific person). In English, pronouns can refer to the preceding noun even if the preceding noun has a generic meaning.
On the other hand, according to the document “Studies on Japanese-English pronouns” (Takaaki Kanzaki, published by Kenkyusha, 1994), in Japanese, pronouns refer to their antecedents when they have a generic meaning. I can't.
[0005]
Therefore, in the sentence (E1), “he” is interpreted as referring to “the average American” (same person), but in the sentence (J1), “he” is “average American” (same person as ) Cannot be interpreted. For this reason, sentence (J1) is a translation that does not convey the meaning of sentence (E1) correctly.
[0006]
In the next sentence (E2), “she” indicates “Mary”.
On the other hand, according to the document “Studies on Japanese-English pronouns” (Takaaki Kanzaki, published by Kenkyusha, 1994), anaphora in Japanese (pronouns refer to nouns appearing behind them) is basically unacceptable. Because it is possible, sentence (J2) cannot be interpreted as “her” pointing to “Mary”.
(E2) Mary scolded John before she left home.
(J2) Mary scolded John before she left home.
[0007]
Conventional techniques for dealing with such problems include the “machine translation device” described in Japanese Patent Application Laid-Open No. 6-30068 and the document “Rules for Zero Pronoun Generation for Natural Japanese Translation Using Morphology” (Miyato Castle, Morio Nagata, IEICE, Technical Research Report NLC2000-5, 2000).
These prior arts attempt to address the above problems by not always translating English pronouns as Japanese pronouns, but not translating them when they are not needed.
[0008]
[Problems to be solved by the invention]
However, the above-described conventional technology has the following two problems.
The former prior art is performed by binary determination of whether a pronoun is translated (leaved) or not translated (deleted).
This former prior art translation processing may not be able to deal with cases such as sentences (J2) appropriately.
[0009]
It is not always appropriate to remove “she” from the sentence (J2). This is because if it is deleted, it becomes ambiguous whether the subject of “Leave Home” is “Mary” or “John”.
In order to convey the same meaning as the sentence (E2) and to suppress the ambiguity of the subject, an expression that can identify “her” as the subject of “out” is “Mary”, For example, it is necessary to replace with “self”.
[0010]
In the latter prior art, rules for determining whether a pronoun is translated or not translated are described manually. Whether pronouns are translated or not translated depends on various factors, so the factors that are related to each other are organized manually, based on the results, consistent rules are written, and maintenance is maintained. There is a problem that it takes a great deal of labor to do so.
[0011]
The present invention has been made in consideration of the above circumstances. For example, a pronoun rewriting rule composed of morpheme / vocabulary attributes expressed in a tree form is created from pronoun rewriting case sentences stored in a case sentence table prepared in advance. Thus, a pronoun rewriting device capable of rewriting a more natural pronoun in post-processing of machine translation of a natural language sentence is provided.
[0012]
[Means for Solving the Problems]
The present invention relates to an input unit for inputting a natural language sentence, a table memory storing a dictionary table, a morpheme analysis rule table, an attribute table, and a case sentence table, and a dictionary table and a morpheme analysis rule table. A morpheme analysis unit that performs morphological analysis of language sentences, an attribute extraction unit that extracts morpheme and vocabulary attributes defined in the attribute table from the morpheme analysis results, and a synonym rewritten case sentence stored in the case sentence table in a tree format A decision tree creation unit that creates a decision tree of pronoun rewriting rules composed of morpheme / vocabulary attributes, and the extracted morpheme / vocabulary attributes corresponding to each pronoun included in a natural language sentence with reference to the decision tree A rewrite rule determining unit that determines the pronoun based on the rewriting rule and a pronoun rewriting unit that rewrites the target pronoun based on the determined rewriting rule Is a pronoun rewriting device to.
[0013]
According to the present invention, post-processing of machine translation of natural language sentences by creating pronoun rewriting rules composed of morphemes and vocabulary attributes expressed in a tree form from pronoun rewriting example sentences stored in a case sentence table prepared in advance Can rewrite more natural pronouns.
Thereby, the independence from a specific machine translation apparatus is high, and it can be generally used for post-editing of various machine translation apparatuses.
[0014]
The decision tree creating unit may be configured to create a decision tree of pronoun rewriting rules by referring to a pronoun rewriting example sentence in the case sentence table and using a decision tree learning method which is one of statistical machine learning methods.
According to this configuration, the decision tree of the pronoun rewriting rule can be created by the decision tree learning method which is one of the statistical machine learning methods, so that the decision tree of the pronoun rewriting rule can be easily created and maintained. .
[0015]
For each pronoun in a natural language sentence, the case sentence table is either a label indicating a process for deleting a pronoun from the sentence, a label indicating a process for leaving a pronoun, or a label indicating a process for replacing a pronoun with another pronoun You may make it the structure which memorize | stored the pronoun rewriting example sentence which gave.
According to this configuration, not only the binary determination of whether to delete or leave the pronoun, but also the multi-value determination to be replaced with another expression is possible, so that more natural pronoun rewriting can be performed.
For example, a pronoun that could not be done with the prior art can be replaced with another pronoun (for example, the replacement of “her” in the sentence (J2) with the reflexive pronoun “my”), resulting in a higher quality translation Can be generated.
[0016]
According to another aspect of the present invention, an input unit for inputting a source language sentence described in a natural language sentence, a source language analysis rule table, a source language / target language conversion rule table, an attribute table, and an example sentence table are stored. The source language sentence is analyzed based on the table memory and the source language analysis rule table, and the syntax tree of the source language sentence obtained from the analysis result is converted into the syntax tree of the target language based on the source language / target language conversion rule table. Analysis and conversion unit, attribute extraction unit that extracts morpheme / vocabulary attributes and syntax attributes defined in the attribute table from the syntax tree of the target language, and tree form from the pronoun rewriting case sentences stored in the case sentence table A decision tree creation unit that creates a decision tree for pronoun rewriting rules consisting of written morpheme / vocabulary attributes and syntax attributes, and rewriting corresponding to each pronoun included in the target language sentence with reference to the decision tree A rewriting rule determining unit that determines a rule based on a morpheme / vocabulary attribute and a syntax attribute attribute extracted from the syntax tree, and a pronoun rewriting unit that rewrites a target pronoun based on the determined rewriting rule Is provided.
According to the present invention, in machine translation, a pronoun rewriting rule in a tree form is created by referring not only to morpheme / vocabulary attributes but also syntactic attributes from pronoun rewriting example sentences stored in a case sentence table prepared in advance. Rewriting pronouns with high accuracy is possible, enabling more natural translation.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail based on the embodiments shown in the drawings. In addition, this invention is not limited by this.
[0018]
FIG. 1 is a block diagram showing a configuration of a pronoun rewriting device that is Embodiment 1 of the present invention. As shown in FIG. 1, the pronoun rewriting device includes a control unit 1, an input unit 2, an output unit 3, a table memory 4, a program memory 5, a buffer memory 6, a bus 7, and a storage medium 8.
[0019]
The control unit 1 is composed of, for example, a CPU, reads the control program from the program memory 5, and controls each unit via the bus 7 according to the control program, thereby realizing the pronoun rewriting function of the present invention.
[0020]
The input unit 2 includes, for example, a keyboard, a tablet, a communication interface that receives data, and the like, and inputs a sentence described in a natural language.
The output unit 3 includes, for example, a display device such as a liquid crystal display, a plasma display, and an EL display, a printing device such as a thermal transfer printer and a laser printer, a communication interface that transmits data, and the like, and is stored in the program memory 5. The result processed by the control program is output.
[0021]
The table memory 4 is composed of, for example, a ROM, EEPROM, floppy disk, hard disk, CD-R / W, etc., and a dictionary table 4a storing morpheme / vocabulary information, rules for morphological analysis of natural language sentences (translations) Morphological analysis rule table 4b storing attribute, attribute table 4c defining morpheme / vocabulary attributes to be extracted from the translation, set of pronoun rewrite case sentences with correct rewriting for pronouns in the translation (corpus with correct answer) Is stored as an example sentence table (corpus table with correct answer) 4d.
[0022]
In the program memory 5, the control unit 1 refers to the dictionary table 4a and the morpheme analysis rule table 4b, and performs a morpheme analysis unit 5a for performing a morpheme analysis of the translation input by the input unit 2, and from the morpheme analysis result to the attribute table 4c. Attribute extraction unit 5b for extracting defined morpheme / vocabulary attributes, decision tree for creating a decision tree of pronoun rewriting rules consisting of morpheme / vocabulary attributes expressed in tree form from pronoun rewriting case sentences stored in case sentence table 4d The creation unit 5c refers to the decision tree, and rewrite rule determination unit 5d that determines a rewrite rule corresponding to each pronoun included in the natural language sentence based on the extracted morpheme / vocabulary attribute, based on the determined rewrite rule A program that functions as the pronoun rewriting unit 5e for rewriting the subject pronoun is stored.
[0023]
The buffer memory 6 includes, for example, a RAM, an EEPROM, a floppy disk, a hard disk, a CD-R / W, and the like. The buffer memory 6 stores a sentence input by the input unit 2 and a morpheme obtained by the morpheme analysis unit 5a. Morphological analysis result buffer 6b for storing the analysis result, attribute buffer 6c for storing the morpheme / vocabulary attribute extracted by the attribute extracting unit 5b, each sentence S of the corpus with correct attribute extracted from the sentence S by the attribute extracting unit 5b A case data buffer 6d for storing a pair with a label indicating correct rewriting for a pronoun assigned to the data, a decision tree buffer 6e for storing a decision tree of a pronoun rewriting rule obtained by the decision tree creating unit 5c, and a rewriting rule determining unit Rewrite rule for storing the result of rewrite rule determination for pronouns determined by 5d Constant result buffer 6f, and a region that functions as a synonym rewriting result buffer 6g for storing natural language text rewritten by pronoun rewriter 5e.
[0024]
The program group stored in the program memory 5 is 1) a program group that automatically creates a rewrite rule for a pronoun based on a pronoun rewrite example sentence (corpus with correct answer), and 2) a created pronoun rewrite rule. It is broadly divided into a group of programs that apply to sentences and rewrite pronouns in input sentences.
In the present invention, pronoun rewriting rules are described in an expression format called a decision tree. The rewriting rule creating program is a program that functions as a decision tree creating unit, and the pronoun rewriting program is a program that functions as a rewriting rule determining unit and a program that functions as a pronoun rewriting unit.
A program that functions as a morphological analysis unit and a program that functions as an attribute extraction unit are used both when creating a pronoun rewriting rule and when pronoun rewriting.
[0025]
[Automatic creation of pronoun rewriting rules]
The morpheme analysis unit 5a performs morpheme analysis on each sentence stored in the case sentence table 4d based on the dictionary table 4a and the morpheme analysis rule table 4b, and stores the result in the morpheme analysis result buffer 6b.
Similarly, the morphological analysis is performed on the input sentence stored in the original sentence buffer 6a.
The morphological analysis is a well-known general technique, and therefore the description is omitted. For morphological analysis, for example, the document “Natural Language Processing” (Akira Nagao, Iwanami Shoten,
1997).
[0026]
The attribute extraction unit 5b extracts the morpheme / vocabulary attributes defined in the attribute table 4c from the morpheme analysis result and stores them in the attribute buffer 6c or the case data buffer 6d.
Attributes are various properties of sentences and input sentences in a corpus with correct answers.
Case data is represented by attributes and classes.
The class corresponds to the label attached to the sentence in the corpus with correct answers. The pronoun rewrite example sentence (corpus with correct answer) indicates whether the pronoun included in the translation output from the machine translation device should be deleted, left as it is, or replaced with another expression. It is a set of sentences with labels attached manually.
[0027]
For example, labels can be added to the sentence (J1) and the sentence (J2) as follows.
(J1 ') By the time the average American is 70 years old, #D he has consumed 13 tons of beef.
(J2 ') #J Mary scolded John before she left home.
[0028]
Here, the label “#D” given to “he” means that “he” should be deleted, and the label “#J” given to “she” is “she” Means that you should replace it with "You".
It is assumed that the label “#N” is given to pronouns that do not appear in this example but need not be deleted. Note that the types of labels are not limited to these three types.
[0029]
FIG. 2 is a diagram illustrating an example of case data stored in the case data buffer of the present embodiment. For example, in each sentence in a pronoun rewriting example sentence (corpus with correct answer), “#D”, which means that it should be deleted, is assigned to the pronoun “he” currently focused on, so sentence (J1) A class of “erasure” corresponds to the attribute set “he”, “ha”, and “present” extracted from. When such processing is performed, the case data shown in FIG. 2 is obtained and stored in the case data buffer 6d.
[0030]
In the case data shown in Fig. 2, the class has a binary value indicating whether the pronoun is “erased” or “remaining”, but if the label attached to the pronoun in the correct corpus is multi-valued, it is automatically Therefore, the problem of the prior art is solved by setting an appropriate label when creating a corpus with a correct answer.
[0031]
The decision tree creation unit 5c classifies the case data stored in the case data buffer 6d to create a pronoun rewriting rule inductively in the form of a decision tree, and stores the rule in the decision tree buffer 6e.
The decision tree consists of a terminal node representing a class and a non-terminal node (discriminant node) corresponding to a test for examining one attribute.
[0032]
The rewrite rule determination unit 5d searches the decision tree from the root node to the terminal node based on the attribute extracted from the input sentence by the attribute extraction unit 5b according to the test result at the discrimination node, and is included in the input sentence. Rewriting rules for pronouns (deletion (#D), replacement with other expressions (#J), retention as it is (#N), etc.) are determined, and the result is stored in the rewriting rule determination result buffer 6f.
[0033]
The pronoun rewriting unit 5e performs a pronoun rewriting process according to the determination made by the rewriting rule determining unit 5d, and stores the result in the pronoun rewriting result buffer 6g.
Attributes to be extracted from the sentence are defined in advance in the attribute table 4c. For example, it is assumed that the three types of attributes to be extracted are “notation of pronouns”, “adjuncts connected to pronouns”, and “presence / presence of pronouns”. .
[0034]
At this time, when the attribute extraction unit 5b processes the morphological analysis result of the sentence (J1) according to the attribute table 4c, “hi”, “ha”, and “exist” are extracted. Here, “present” means the existence of “average American”.
Further, the decision tree creation unit 5c creates a decision tree from the case data according to a method called C4.5 described in the document “Data analysis by AI” (JRQuinlan, translated by Koichi Furukawa, Toppan, 1995).
[0035]
The decision tree according to C4.5 is created by sequentially selecting a test X for dividing the case set T into n subsets according to the gain criterion.
The gain criterion is info (T) and info expressed by the following equations: _X Difference from (T) (gain)
gain (X) = info (T) -info _X (T)
This is the criterion for selecting the test X that maximizes.
[0036]

Where freq (C _j , T) is class C in case set T _j Means the number of all cases included in the case set T. | T |
[0037]
FIG. 3 is a flowchart showing a decision tree creation procedure for the pronoun rewriting rules of this embodiment. Create a decision tree according to C4.5.
In step S101, all cases are assigned to root nodes.
In step S103, gain for the case set T is gained. _X Select the test X that maximizes (T).
[0038]
When the case data shown in FIG. 2 exists, gain (X) is calculated as follows. In the case data of FIG. 2, there are 9 case data of class “erased” and 5 case data of class “remaining”.

It is.
[0039]
In step S104, the test X is used to divide the case set into subsets, and a decision tree is grown using each subset as a node.
When the case set is divided into three according to the value of the attribute “notation”,

It becomes. Therefore, the gain obtained by the division by the test based on the attribute “notation” is
gain (notation) = info (T) −info [notation] (T) = 0.94−0.694 = 0.246
[0040]
If the gain in the case of dividing the case data collection in accordance with the value of the attribute “adjunct” and the value of the attribute “preceding (pronoun) noun” is calculated in the same manner, a larger gain than in the case of the attribute “notation” cannot be obtained. Therefore, the test for the attribute “notation” is selected as the test to be performed at the root node.
[0041]
If the same process is continued until the end condition is satisfied in step S102, a final decision tree is created.
[0042]
FIG. 4 is a diagram illustrating an example of a result of creating a tree of pronoun rewriting rules according to the present embodiment. The decision tree of the pronoun rewriting rules shown in FIG. 4 is created from the case data collection shown in FIG. 2 and stored in the decision tree buffer 6e. The non-terminal node of the decision tree has an attribute corresponding to a test, the class name is described in the terminal node, and an attribute value is assigned to the branch.
[0043]
[Application of pronoun rewriting rules]
The process of rewriting pronouns by applying the decision tree of pronoun rewriting rules automatically created from the case data collection by the above method to the input sentence includes the morphological analysis unit 5a, attribute extraction unit 5b, rewrite rule determination unit 5d, pronoun This is executed by the rewriting unit 5e.
The morphological analysis unit 5a and the attribute extraction unit 5b perform the same processing on the input sentence as the processing performed on each sentence in the corpus with correct answers. Thereby, the attribute is extracted from the input sentence and stored in the attribute buffer 6c.
For example, assume that the next sentence (J3) is an input sentence.
[0044]
(E3) She went home because she didn't feel well.
(J3) She returned home because she was not feeling well.
In the sentence (J3), the pronoun “her” appears twice,
Pronoun notation = she
The adjunct ＝ is associated with pronouns
Presence or absence of pronouns that can be pointed to by pronouns = None
Attribute value and
Pronoun notation = she
The adjunct = associated with a pronoun
Presence of pronouns that can be pointed to by pronouns = Yes
The attribute value is obtained.
[0045]
The rewrite rule determining unit 5d follows the decision tree from the root node to the terminal node based on the attribute extracted from the input sentence by the attribute extracting unit 5b while following the test result at the discrimination node.
[0046]
FIG. 5 is a flowchart showing the rewrite rule determination procedure of this embodiment.
In the case of the first attribute value extracted from the sentence (J3), the process of following the decision tree of FIG. 4 is as follows.
First, in step S201, the root node is set as the node of interest.
Since the current node of interest is not a terminal node, the process goes to step S203 via step S202.
In step S203, since the attribute (test) at the node of interest is “notation” and the value of the attribute “notation” of the input sentence is “she”, the branch to which “she” is attached is traced to “preceding ( The node of “pronoun” is the next node of interest.
[0047]
Returning to step S203 via step S202, when the test of “preceding (pronoun) noun” is performed at the current node of interest, the value of the attribute “preceding (pronoun) noun” of the input sentence is “none”, ”Is followed, and the terminal node“ off ”is set as the next node of interest.
In step S202, since the current node of interest is a terminal node, scanning is terminated, and the class name “erased” described in the terminal node that has been reached is stored in the rewrite rule determination result buffer 6f together with the target pronoun.
[0048]
Similarly, in the case of the second attribute value, the decision tree is traced, and the class name “remaining” is stored in the rewrite rule decision result buffer 6f together with the subject pronoun.
The pronoun rewriting unit 5e rewrites the target pronoun according to the target pronoun stored in the rewrite rule determination result buffer 6f and the class name corresponding to the target pronoun, that is, the type of rewriting.
Through the above processing, the phrase “she is” including the pronoun is deleted from the sentence (J2), and the following sentence is stored in the pronoun rewriting result buffer 6g.
(J3 ') She went home because she wasn't feeling well.
[0049]
FIG. 6 is a block diagram showing the configuration of a machine translation apparatus that is Embodiment 2 of the present invention. The machine translation apparatus according to the second embodiment includes a control unit 1, an input unit 2, an output unit 3, a table memory 4, a program memory 5, a buffer memory 6, a bus 7, and a storage medium 8.
Since the control part 1, the input part 2, and the output part 3 are the same as each part in the pronoun rewriting apparatus of Example 1, description is abbreviate | omitted.
[0050]
The table memory 4 is a source language analysis rule table 4a ′ that stores rules for analyzing a source language sentence, and a source language / target language that stores rules for converting a syntax tree of a source language sentence into a syntax tree of a target language. The conversion rule table 4b ′, the morpheme level attribute to be extracted from the syntax tree, the attribute table 4c defining the syntax attribute, and the set of sentences (correct corpus) with a label indicating correct rewriting for the pronoun in the sentence are stored. An area that functions as an example sentence table (corpus table with correct answer) 4d is provided.
[0051]
The program memory 5 stores programs that function as an analysis / conversion unit 5a, an attribute extraction unit 5b, a decision tree creation unit 5c, a rewrite rule determination unit 5d, and a pronoun rewriting unit 5e.
[0052]
The analysis / conversion unit 5a ′ analyzes the sentence stored in the source sentence buffer 6a based on the source language analysis rule table 4a ′, and converts the source language based on the source language / target language conversion rule table 4b ′. Convert to the target language.
The attribute extraction unit 5b extracts the syntax attribute defined in the attribute table 4c from the syntax tree.
The decision tree creation unit 5c and the rewrite rule determination unit 5d are the same as the decision tree creation unit 5c and the rewrite rule determination unit 5d in the apparatus of the first embodiment.
The pronoun rewriting unit 5e includes the function of the pronoun rewriting unit in the apparatus of the first embodiment.
[0053]
The buffer memory 6 includes areas that function as a source text buffer 6a, a syntax tree buffer 6b ′, an attribute buffer 6c, a case data buffer 6d, a decision tree buffer 6e, a rewrite rule determination result buffer 6f, and a pronoun rewrite result buffer 6g.
[0054]
The original text buffer 6a, attribute buffer 6c, case data buffer 6d, decision tree buffer 6e, and rewrite rule decision result buffer 6f are the same as those in the apparatus of the first embodiment.
The syntax tree buffer 6b ′ stores the syntax tree analyzed and converted by the analysis / conversion unit 5a ′.
The pronoun rewriting result buffer 6g stores the sentence created by the pronoun rewriting unit 5e.
[0055]
The main difference between the pronoun rewriting device of the first embodiment and the machine translation device of the second embodiment is the nature of the attributes that can be extracted by the attribute extraction unit.
That is, the attributes extracted by the attribute extraction unit in the first embodiment are those at the morpheme / vocabulary level, whereas the attributes extracted by the attribute extraction unit in the second embodiment are those at the morpheme / vocabulary level and the syntax level. It is.
[0056]
Examples of attributes extracted by the attribute extraction unit in the second embodiment include “information related to pronoun related words” (notation, part of speech and meaning code of related words) and “information related to clauses including pronouns”. (Main clause or subordinate clause).
[0057]
For example, the following syntax level attributes can be extracted from the sentence (J1).
Notation of the pronoun's relationship = consumption
Part-of-speech for pronouns = verb
Semantic code of pronouns' contacts = 1371007
Clauses with pronouns = main clause
[0058]
The syntax level attributes extracted from the sentence (J2) are as follows.
Denotation of the pronoun's name
Part-of-speech for pronouns = verb
Semantic code for pronouns' contact numbers = 2121001
Clauses with pronouns = subordinate clauses
[0059]
There are a variety of factors that determine how pronouns should be rewritten. Thus, in order to improve the accuracy of pronoun rewriting, it is necessary to refer to as many types of attributes as possible. In this regard, in the apparatus of the second embodiment, not only morpheme / vocabulary level attributes but also syntax level attributes are referenced, so that more accurate pronoun rewriting can be realized.
[0060]
FIG. 7 is a diagram illustrating the relationship between the pronoun rewriting device of the first embodiment and the machine translation device of the second embodiment. As shown in FIG. 7, the pronoun rewriting device according to the first embodiment is intended to suppress the creation of an unnatural target language sentence caused by a difference in language habit between the source language and the target language in machine translation. Yes, it is not incorporated into a specific machine translation device, and can be applied to post-editing of machine translation.
[0061]
Therefore, the first embodiment is a pronoun rewriting device as post-editing after machine translation, and has an advantage of high independence from the machine translation device. In order to maintain this independence, the attributes that can be referred to are assigned morphemes. Since it is limited to the vocabulary level, it is expected that it is not higher than the accuracy of pronoun rewriting by the machine translation apparatus of the second embodiment.
[0062]
The machine translation apparatus according to the second embodiment incorporates the pronoun rewriting process into the machine translation apparatus as a function of the machine translation process, but further refers to the syntactic attributes, and thus the pronoun by the apparatus according to the second embodiment. The rewriting accuracy can be expected to be higher than that of the apparatus of the first embodiment.
[0063]
The pronoun rewriting device or machine translation device described above is realized by a program for causing computer processing to function.
The subject of the invention may be the program itself or a program stored in a computer-readable storage medium 8.
First, in the present invention, as the storage medium 8, a memory necessary for processing by the microcomputer of the control unit 1 shown in FIG. 1 or FIG. 6, for example, a ROM itself is a program medium. Alternatively, although not shown, a program reading device may be provided as an external storage device, and the program medium may be read by inserting a storage medium therein. In either case, the stored program may be configured to be accessed and executed by the microcomputer, or in any case, the program is read and the read program is illustrated in the microcomputer. The program may be loaded into a program storage area that is not loaded and the program is executed. It is assumed that this loading program is stored in the main device in advance.
[0064]
The program medium is a storage medium configured to be separable from the main body, such as a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, or a CD-ROM / MO / MD / DVD. Even a medium carrying a fixed program, including a disk system of an optical disk, a card system such as an IC card (including a memory card) / optical card, or a semiconductor memory such as a mask ROM, EPROM, EEPROM, flash ROM, etc. Good.
In the present invention, since the system configuration is connectable to a communication network including the Internet, it may be a medium carrying a fluid program so as to download the program from the communication network. When the program is downloaded from the communication network as described above, the download program may be stored in the apparatus main body in advance, or may be installed from another storage medium.
The content stored in the storage medium is not limited to a program, and may be data.
[0065]
Next, in the present invention, the program itself may be the process itself executed by the microcomputer of the control unit 1 shown in FIG. 1, or may be taken in by accessing a communication network including the Internet. . Alternatively, it may be taken in or sent out from here. Further, it may be generated as a result of processing in the pronoun rewriting device or the machine translation device based on the fetched program. Alternatively, it may be generated as a result of processing in the above-mentioned pronoun rewriting device or machine translation device when it is sent out from here.
In addition, these things are not limited to a program, Data may be sufficient.
[0066]
【The invention's effect】
According to the present invention, by creating pronoun rewriting rules composed of morphemes and attributes expressed in a tree form from pronoun rewriting example sentences stored in a case sentence table prepared in advance, in post-processing of machine translation of natural language sentences Can rewrite more natural pronouns.
Thereby, the independence from a specific machine translation apparatus is high, and it can be generally used for post-editing of various machine translation apparatuses.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a pronoun rewriting device according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of case data stored in a case data buffer according to the embodiment.
FIG. 3 is a flowchart showing a procedure for creating a decision tree for pronoun rewriting rules according to the embodiment;
FIG. 4 is a diagram illustrating an example of a result of creating a tree of pronoun rewriting rules according to the embodiment.
FIG. 5 is a flowchart showing a rewrite rule determination procedure according to the embodiment.
FIG. 6 is a block diagram illustrating a configuration of a machine translation apparatus that is Embodiment 2 of the present invention.
FIG. 7 is a diagram illustrating a relationship between the pronoun rewriting device according to the first embodiment and the machine translation device according to the second embodiment.
[Explanation of symbols]
1 Control unit
2 Input section
3 Output section
4 Table memory
4a Dictionary table
4a 'Source language analysis rule table
4b Morphological analysis rule table
4b 'Source language / target language conversion rule table
4c attribute table
4d example sentence table (corpus table with correct answer)
5 Program memory
5a Morphological analyzer
5a 'Analysis / conversion unit
5b Attribute extraction unit
5c Decision tree generator
5d Rewrite rule decision section
5e Pronoun Rewriting Department
6 Buffer memory
6a Text buffer
6b Morphological analysis result buffer
6b 'syntax tree buffer
6c Attribute buffer
6d case data buffer
6e Decision tree buffer
6f Rewrite rule decision result buffer
6g pronoun rewriting result buffer
7 Bus
8 storage media

Claims

An input unit Ru to enter the natural language sentence,
Dictionary table storing morpheme / vocabulary information, morpheme analysis rule table storing rules for morphological analysis of natural language sentences, attribute table defining morpheme / vocabulary attributes to be extracted from natural language sentences, and pronouns in natural language sentences A table memory storing a case sentence table storing a set of pronoun rewriting case sentences with a label indicating correct rewriting for ,
A morphological analysis unit for the dictionary table and with reference to the morphological analysis rule table, performs morphological analysis of the input natural language text,
An attribute extractor for extracting morphemes vocabulary attributes defined in the attribute table from the morphological analysis result of the morphological analysis unit,
A decision tree creation unit for creating a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes expressed in a tree format from the pronoun rewrite Case statements stored in the case statement table,
Said decision tree creation unit refers to the decision tree created by the rewrite rule determination unit which determines on the basis of the morpheme vocabulary attributes rewrite rules are the extracted corresponding to each pronoun contained in natural language text,
Pronoun rewriting device being characterized in that a pronoun rewriting unit for rewriting the pronouns of the target on the basis of the rewrite rules that the rewrite rule determination unit has determined.

The decision tree creating unit refers to the pronoun rewriting case statement previously stored article sentence table, creating a decision tree pronouns rewrite rules by decision tree learning method, which is one of the statistical machine learning methods The pronoun rewriting device according to claim 1, wherein

For each pronoun in a natural language sentence, the case sentence table is either a label indicating a process for deleting a pronoun from the sentence, a label indicating a process for leaving a pronoun, or a label indicating a process for replacing a pronoun with another pronoun The pronoun rewriting device according to claim 1, wherein the pronoun rewriting example sentence to which is given is stored.

From the input unit, to input a natural language sentence,
In the table memory, morphological, lexical information the stored dictionary table, morphological analysis rule table storing rules for morphological analysis of natural language sentence, the attribute table and natural that defines morpheme vocabulary attributes to be extracted from the natural language A case sentence table storing a set of pronoun rewrite case sentences with labels indicating correct rewriting for pronouns in a language sentence is stored,
Morphological analysis unit refers to the dictionary table and the morphological analysis rule table performs morphological analysis of the input natural language text,
Attribute extraction unit extracts a morpheme vocabulary attributes defined in the attribute table from the morphological analysis result of the morphological analysis unit,
Decision tree creation unit creates a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes expressed in a tree format from the pronoun rewrite Case statements stored in the case statement table,
The rewrite rule determining unit refers to the decision tree created by the decision tree creating unit, determines a rewrite rule corresponding to each pronoun included in the natural language sentence based on the extracted morpheme / vocabulary attribute,
Pronouns rewriter is synonymous rewriting wherein the rewriting pronouns object based on rewrite rules the rewrite rule determination unit has determined.

The decision tree creating unit refers to the pronoun rewriting case sentence stored in the case sentence table, and creates a decision tree of pronoun rewriting rules by a decision tree learning method which is one of statistical machine learning methods. The pronoun rewriting method according to claim 4, wherein the pronoun rewriting method is characterized in that:

For each pronoun in a natural language sentence, the case sentence table is either a label indicating a process for deleting a pronoun from the sentence, a label indicating a process for leaving a pronoun, or a label indicating a process for replacing a pronoun with another pronoun 5. The pronoun rewriting method according to claim 4, wherein a pronoun rewriting example sentence to which is given is stored.

An input unit Ru to enter the source language sentence written in a natural language sentence,
Source language analysis rule table storing rules for analyzing source language sentences, source language / target language conversion rule table storing rules for converting source language sentence syntax tree to target language syntax tree, from syntax tree A table memory that stores an attribute table that defines morpheme level attributes to be extracted, an attribute table that defines syntactic attributes, and a case sentence table that stores a set of pronoun rewrite case sentences with a label indicating correct rewriting for the pronouns in the sentence;
Analysis on the basis of the source language analysis rule table analyzes the source language sentence, into a syntax tree in the target language based on the syntax tree obtained original language sentence from the analysis result to the source language, target language translation rule table A conversion unit;
An attribute extractor for extracting morphemes vocabulary attributes and syntax attributes defined in the attribute table from the syntax tree of the transformed target language by the analysis and conversion unit,
A decision tree creation unit for creating a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes and syntax attributes expressed in tree format from the stored to the case statement table the pronoun rewriting case statement,
Referring to decision tree created by the decision tree creation unit is determined based on the target morpheme vocabulary attributes and syntax Attribute extracted rewrite rules from the syntax tree for each pronoun included in language sentence A rewrite rule determination unit;
Machine translation apparatus characterized by comprising a pronoun rewriting unit for rewriting the pronouns of the target on the basis of the rewrite rules that the rewrite rule determination unit has determined.

The decision tree creating unit refers to the pronoun rewriting example sentence stored in the case sentence table and creates a decision tree of a pronoun rewriting rule by a decision tree learning method which is one of statistical machine learning methods. The machine translation apparatus according to claim 7, wherein the machine translation apparatus is a machine translation device.

For each pronoun in a natural language sentence, the case sentence table is either a label indicating a process for deleting a pronoun from the sentence, a label indicating a process for leaving a pronoun, or a label indicating a process for replacing a pronoun with another pronoun The machine translation device according to claim 7, wherein a pronoun rewriting example sentence to which is added is stored.

From the input unit, to enter the source language sentence written in a natural language sentence,
Source language analysis rule table storing rules for analyzing source language sentences in a table memory ; source language / target language conversion rule table storing rules for converting source language sentence syntax trees to target language syntax trees A morpheme level attribute to be extracted from the syntax tree , an attribute table that defines the syntactic attribute, and a case sentence table that stores a set of pronoun rewriting case examples with labels indicating correct rewriting for pronouns in the sentence,
Analysis and conversion unit based on the source language analysis rule table analyzes the source language sentence, the target language based on the syntax tree obtained original language sentence from the analysis result to the source language, target language translation rule table To the syntax tree of
Attribute extraction unit extracts a morpheme vocabulary attributes and syntax attributes defined in the attribute table from the syntax tree of the transformed target language by the analysis and conversion unit,
Decision tree creation unit creates a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes and syntax attributes expressed in tree format from the pronoun rewrite Case statements stored in the case statement table,
The rewrite rule determining unit refers to the decision tree created by the decision tree creating unit, and rewrite rules corresponding to each pronoun included in the target language sentence are extracted from the syntax tree. Based on
Pronouns rewriting unit, a machine translation method characterized by rewriting the pronouns of the target on the basis of the rewrite rules that the rewrite rule determination unit has determined.

On the computer,
From the input unit, and a function of Ru to input a natural language sentence,
In the table memory, morphological, lexical information the stored dictionary table, morphological analysis rule table storing rules for morphological analysis of natural language sentence, the attribute table and natural that defines morpheme vocabulary attributes to be extracted from the natural language a function label indicating correct rewriting for pronoun language sentence is Ru stores the case statement table storing a set of synonymous rewrite case statement granted,
And functional morphological analysis unit, wherein referring to the dictionary table and the morphological analysis rule table, performs morphological analysis of the input natural language text,
Attribute extraction unit includes a function of extracting a morpheme vocabulary attributes defined in the attribute table from the morphological analysis result of the morphological analysis unit,
The function of decision tree creation unit creates a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes expressed in a tree format from the pronoun rewrite Case statements stored in the case statement table,
Rewrite rule determination unit refers to the decision tree created by the decision tree creation unit, and a function of determining based on morpheme vocabulary attributes rewrite rules are the extracted corresponding to each pronoun contained in natural language text ,
Pronoun rewriting unit, the pronoun rewriting program for realization of the function of rewriting the pronouns of the target based on the determined rewrite rules rewrite rule determination unit.

On the computer,
From the input unit, and a function of Ru to enter the original word language text, which is described in a natural language sentence,
In the table memory, the source language analysis and stored rules to parse the statement of the original language rules table, the source language sentence original language, the target language conversion rule table that stores the rules for converting the syntax tree to the syntax tree of the target language of morpheme level attribute to be extracted from the syntax tree, Ru stores the case statement table label indicating the correct rewriting for pronoun attribute table and text that defines the syntax attribute stores a set of synonymous rewrite case statement granted function When,
Analysis and conversion unit, the analyzes the source language sentence based on the source language analysis rule table, in the target language based on the syntax tree obtained original language sentence from the analysis result to the source language, target language translation rule table The ability to convert to a syntax tree,
Attribute extraction unit includes a function of extracting a morpheme vocabulary attributes and syntax attributes defined in the attribute table from the syntax tree in the target language,
The function of decision tree creation unit creates a decision tree synonymous rewrite rules consisting morpheme vocabulary attributes and syntax attributes from the stored the pronoun rewrite Case statement was expressed in tree form on the case statement table,
Rewrite rule determination unit, the decision tree creation unit refers to the decision tree created by morphological, lexical attributes and syntax Attribute extracted rewrite rules from the syntax tree for each pronoun included in the target language sentence The ability to make decisions based on
Pronoun rewriting unit, the machine translation program for realization of the function of rewriting the pronouns of the target based on the determined rewrite rules rewrite rule determination unit.

A computer-readable storage medium storing the program according to claim 12 or 13.