JP4573432B2

JP4573432B2 - Word classification method in kanji sentences

Info

Publication number: JP4573432B2
Application number: JP2000531795A
Authority: JP
Inventors: ウー，アンディ; リチャードソン，スティーヴン・ディー; チャン，チーシン
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 1998-02-13
Filing date: 1999-01-13
Publication date: 2010-11-04
Anticipated expiration: 2019-01-13
Also published as: WO1999041680A2; JP5100770B2; CN1114165C; WO1999041680A3; CN1290371A; JP2002503849A; JP2010157260A; EP1055182A2

Description

【０００１】
（技術分野）
本発明は、一般的に、自然言語処理の分野に関し、更に特定すれば単語区分（word segmentation）の分野に関するものである。
【０００２】
（発明の背景）
単語区分とは、文のような言語の表現を構成する個々の単語を識別するプロセスのことである。単語の区分は、綴りや文法をチェックしたり、文から音声を合成したり、自然言語の解析や理解を行なったりする際に有用である。これらは全て、個々の単語の識別によって得られる効果である。
【０００３】
英文の場合、単語の区分を行なうのはむしろ単純である。即ち、空間や句読点符号が、文内の個々の単語を区切っているからである。以下の表１における英文を考えてみる。
【０００４】
【表１】

【０００５】
隣接する一連の空間および／または一連の空間に先立つ単語の末端としての句読点符号を識別することによって、表１の英文は、以下の表２に示すように単純に区分することができる。
【０００６】
【表２】

【０００７】
中国語の文では、単語の境界は、明示的ではなくむしろ暗示的である。以下の表３における文章を考えてみる。これは、“委員会はこの問題を昨日の午後ブエノス・アイレスで論じた”という意味である。
【０００８】
【表３】

【０００９】
文章には句読点や空間がないにも拘らず、中国語の読者であれば、表３の文章を、以下の表４において下線を引いて区別した単語から成るものとして認識する。
【００１０】
【表４】

【００１１】
上述の例から、中国語の単語区分は、英語の単語区分と同様にはできないことがわかる。したがって、中国語の区分を自動的に行なう高精度で効率的な手法があれば、大きな有用性を有するであろう。
【００１２】
（発明の概要）
本発明によれば、単語区分ソフトウエア・ファシリティ（“ファシリティ”）が、中国語のような非区分言語における文の単語区分操作を行なう際に、（１）入力文章における文字の可能な組み合わせを評価して、入力文章内の単語を表す可能性がないものを破棄し、（２）辞書において残りの文字の組み合わせを調べ、これらが単語を構成できるか否か判定し、（３）単語であると判定した文字の組み合わせを、入力文章を表す代替語彙レコードとして、自然言語パーザに提出する。パーザは、入力文章の構文構造を表す構文解析ツリーを生成する。これは、入力文章における単語であることが証明された文字の組み合わせを表す語彙レコードのみを含む。語彙レコードをパーザに提出する際、ファシリティは、語彙レコードに重み付けを行い、パーザが長い文字の組み合わせを、短い文字の組み合わせよりも前に検討するようにする。何故なら、一般に、長い文字の組み合わせの方が、短い文字の組み合わせよりも文章の正しい区分を表す場合が多いからである。
【００１３】
入力文章における単語を表す可能性が低い文字の組み合わせを容易に破棄するために、ファシリティは、辞書内で現れる文字毎に、（１）単語長および単語が現れる文字位置の異なる組み合わせ全ての指示、および（２）この文字が単語を開始するときに、この文字の後に続く可能性がある文字全ての指示を、辞書に追加する。更に、ファシリティは、（３）多文字単語内に部分単語が存在可能で検討すべきか否かについて、多文字単語に対する指示も追加する。文章を処理する際、ファシリティは、いずれかの単語が辞書内にない単語長／位置の組み合わせで用いられている文字の組み合わせ、および（２）２番目の文字が最初の文字に可能な第２文字としてリストされていない文字の組み合わせを破棄する。更に、ファシリティは、（３）部分単語を考慮しない単語内に現れる文字の組み合わせも破棄する。
【００１４】
このように、ファシリティは、辞書内で調べる文字の組み合わせ数を最少に抑え、かつ文章の構文的文脈を利用して各々有効な単語から構成された区分選択肢の結果間で差別化する。
【００１５】
（発明の詳細な説明）
本発明は、中国語文において単語区分を行なう。好適な実施形態では、単語区分ソフトウエア・ファシリティ（“ファシリティ”）が、中国語のような非区分言語における文の単語区分を行なう際に、（１）入力文章における文字の可能な組み合わせを評価して、入力文章内の単語を表す可能性がないものを破棄し、（２）辞書において残りの文字の組み合わせを調べ、これらが単語を構成できるか否か判定し、（３）単語であると判定した文字の組み合わせを、入力文章を表す代替語彙レコードとして、自然言語パーザに提出する。パーザは、入力センテンスの構文構造を表す構文解析ツリーを生成する。これは、入力文章における単語であることが証明された文字の組み合わせを表す語彙レコードのみを含む。語彙レコードをパーザに提出する際、ファシリティは、語彙レコードに重み付けを行い、パーザが長い文字の組み合わせを、短い文字の組み合わせよりも前に検討するようにする。何故なら、一般に、長い文字の組み合わせの方が、短い文字の組み合わせよりも文章の正しい区分を表す場合が多いからである。
【００１６】
入力文章における単語を表す可能性が低い文字の組み合わせを容易に破棄するために、ファシリティは、辞書内で現れる文字毎に、（１）単語長および単語が現れる文字位置の異なる組み合わせ全ての指示、および（２）この文字が単語の先頭にあるときに、この文字の後に続く可能性がある文字全ての指示を、辞書に追加する。更に、ファシリティは、（３）多文字単語に対して、多文字単語内に部分単語が存在可能であり検討すべきか否かについての指示も追加する。文章を処理する際、ファシリティは、（１）いずれかの単語が辞書内にない単語長／位置の組み合わせで用いられている文字の組み合わせ、および（２）２番目の文字が最初の文字に可能な第２文字としてリストされていない文字の組み合わせを破棄する。更に、ファシリティは、（３）部分単語を考慮しない単語内に現れた文字の組み合わせも破棄する。
【００１７】
このように、ファシリティは、辞書内で調べる文字の組み合わせ数を最少に抑え、かつ文章の構文的文脈を利用して、各々有効な単語から構成された代替区分の結果間で差別化する。
【００１８】
図１は、ファシリティが実行するのが好ましい汎用コンピュータ・システムの上位ブロック図である。コンピュータ・システム１００は、中央演算装置（ＣＰＵ）１１０、入出力デバイス１２０、およびコンピュータ・メモリ（メモリ）１３０を内蔵する。入出力装置間には、ハード・ディスク・ドライブのような記憶装置１２１、ＣＤ−ＲＯＭのようなコンピュータ読取可能媒体上で供給され、ファシリティを含むソフトウエア製品をインストールするために使用可能なコンピュータ読取可能媒体ドライブ１２２、およびコンピュータ１００が他の接続してあるコンピュータ・システム（図示せず）と通信可能なネットワーク接続部１２３がある。メモリ１３０は、中国語文内に現れる個々の単語を識別する単語区分ファシリティ１３１、自然言語文内に現れる単語を表す語彙レコードから、自然言語文の文章の構文構造を表す解析ツリーを生成する構文パーザ１３３、およびパーザが用いて解析ツリーのための語彙レコードを構築し、ファシリティが用いて自然言語文内に現れる単語を識別する語彙知識ベース１３２を含むことが好ましい。ファシリティは、前述のように構成したコンピュータ・システム上で実現することが好ましいが、異なる構成を有するコンピュータ・システム上でも実現可能であることを当業者は認めよう。
【００１９】
図２は、ファシリティが動作することが好ましい２つのフェーズを示す概略フロー図である。ステップ２０１において、初期化フェーズの一部として、ファシリティは語彙知識ベースを増強し、ファシリティが単語区分を実行する際に用いる情報を含ませる。ステップ２０１については、図３と関連付けて以下で更に詳しく論ずる。端的に言うと、ステップ２０１では、ファシリティは語彙知識ベース内のいずれかの単語内に現れた文字について、語彙知識ベースにエントリを追加する。文字毎に追加するエントリは、文字が単語内に現れる異なる位置を表すＣｈａｒＰｏｓ属性を含む。更に、文字毎のエントリは、現文字で始まる単語の２番目の位置において現れる文字の集合を示すＮｅｘｔＣｈａｒｓ属性も含む。最後に、ファシリティは、語彙知識ベース内に現れる各単語にＩｇｎｏｒｅＰａｒｔｓ属性も追加する。これは、当該単語を構成する文字列も、現単語を共に構成する、より小さな単語を構成すると考えるべきか否かを示す。
【００２０】
ステップ２０１の後、ファシリティはステップ２０２に進み、初期化フェーズを終了し、単語区分フェーズを開始する。単語区分フェーズでは、ファシリティは、語彙知識ベースに追加した情報を用いて、中国語文の文章の単語区分を実行する。ステップ２０２において、ファシリティは、単語区分のために、中国語文の文章を受け取る。ステップ２０３において、ファシリティは、受け取った文章をその構成単語に区分する。ステップ２０３については、図５と関連付けて以下で更に詳しく論ずる。端的に言えば、ファシリティは、語彙知識ベースにおいて、文章内の文字の可能な連続する組み合わせ全ての小さな断片を調べる。次いで、ファシリティは、語彙知識ベースによって単語であることが示された文字の調査済みの組み合わせを、構文パーザに提出する。パーザは、文章の構文構造を判定する際に、文章の著者が当該文章において単語を構成しようと意図した文字の組み合わせを識別する。ステップ２０３の後、ファシリティはステップ２０２に進み、単語区分のために次の文章を受け取る。
【００２１】
図３は、初期化フェーズにおいて語彙知識ベースを増強し、単語区分を実行する際に用いる情報を含ませるために、ファシリティが実行することが好ましいステップを示すフロー図である。これらのステップは、（ａ）語彙知識ベース内の単語に現れる文字について、語彙知識ベースにエントリを追加し、（ｂ）語彙知識ベース内にあるこの文字のエントリにＣｈａｒＰｏｓおよびＮｅｘｔＣｈａｒｓ属性を追加し、（ｃ）語彙知識ベース内の単語に対するエントリに、ＩｇｎｏｒｅＰａｒｔｓ属性を追加する。
【００２２】
ファシリティは、語彙知識ベース内の単語エントリ毎に、ステップ３０１〜３１２のループを繰り返す。ステップ３０２において、ファシリティは単語内の文字位置毎にループを繰り返す。即ち、３つの文字を含む単語について、ファシリティは、当該単語の第１、第２および第３文字に対してループを実行する。ステップ３０３において、現文字位置にある文字が語彙知識ベース内にエントリを有する場合、ファシリティはステップ３０５に進み、それ以外の場合、ファシリティはステップ３０４に進む。ステップ３０４において、ファシリティは現文字のエントリを語彙知識ベースに追加する。ステップ３０４の後、ファシリティはステップ３０５に進む。ステップ３０５において、ファシリティは、語彙知識ベース内の文字のエントリに格納してあるＣｈａｒＰｏｓ属性に、整列対（ｏｒｄｅｒｅｄｐａｉｒ）を追加し、その文字は、現単語において現れた位置において現れる可能性があることを示す。追加する整列対は、（ｐｏｓｉｔｉｏｎ、ｌｅｎｇｔｈ）という形態を有し、ｐｏｓｉｔｉｏｎとは当該文字が単語内で占める位置であり、ｌｅｎｇｔｈは単語内の文字数である。例えば、

【００２３】
という単語における文字“委”について、ファシリティは、整列対（１，３）を、文字“委”に対する語彙知識ベース・エントリ内のＣｈａｒＰｏｓ属性に格納されている整列対のリストに追加する。好ましくは、ファシリティは、整列対が既に現単語に対するＣｈａｒＰｏｓ属性に既に含まれている場合、ステップ３０５で説明したように、整列対を追加しない。ステップ３０６において、処理する現単語に未だ文字が残っている場合、ファシリティはステップ３０２に進み次の文字を処理する。それ以外の場合、ファシリティはステップ３０７に進む。
ステップ３０７において、単語が単一文字単語である場合、ファシリティはステップ３０９に進み、それ以外の場合ファシリティはステップ３０８に進む。ステップ３０８において、ファシリティは、現単語の２番目の位置にある文字を、現単語の第１位置にある文字の語彙知識ベース・レコード内にあるＮｅｘｔＣｈａｒｓ属性内の文字リストに追加する。例えば、

【００２４】
という単語では、ファシリティは、文字

【００２５】
を文字“委”のＮｅｘｔＣｈａｒｓ属性に対して格納してある文字リストに追加する。ステップ３０８の後、ファシリティはステップ３０９に進む。
ステップ３０９において、現単語が他の更に小さい単語を含むことができる場合、ファシリティはステップ３１１に進み、それ以外の場合ファシリティはステップ３１０に進む。ステップ３０９については、図４と関連付けて以下で更に詳しく論ずる。端的に言えば、ファシリティは多数の発見的方法を用いて、現単語を構成する文字列がある場合、ある文脈では、この文字列が２つ以上のより小さな単語を構成する可能性があるか否かについて判定を行なう。
【００２６】
ステップ３１０において、ファシリティは、前述の単語に対する語彙知識ベース・エントリにおいて、単語のＩｇｎｏｒｅＰａｒｔｓ属性をセット（set）する。ＩｇｎｏｒｅＰａｒｔｓ属性をセットすると、ファシリティがこの単語を入力文の文章内において発見した場合、この単語がより小さい単語を含むか否かについて判定するこれ以上のステップを実行しないことを意味する。ステップ３１０の後、ファシリティはステップ３１２に進む。ステップ３１１において、現単語は他の単語を含む可能性があるので、ファシリティは当該単語に対するＩｇｎｏｒｅＰａｒｔｓ属性をクリア（clear）し、入力文の文章内でその単語を発見した場合、ファシリティは当該単語がより小さな単語を含むか否かについての調査に進むようにする。ステップ３１１の後、ファシリティはステップ３１２に進む。ステップ３１２において、処理する語彙知識ベースに未だ単語が残っている場合、ファシリティはステップ３０１に進み、次の単語を処理する。それ以外の場合、これらのステップは終了する。
【００２７】
ファシリティが図３に示すステップを実行し、ＣｈａｒＰｏｓおよびＮｅｘｔＣｈａｒｓ属性を各文字に割り当てることによって、語彙知識ベースを増強する際、以下の表５に示すように、表３に示したサンプル文章内に現れた文字（Character）に対して、これらの属性を割り当てる。
【００２８】
表５：文字語彙知識ベース・エントリ
【表５】

【００２９】
図５の表から、例えば、文字“昨”のＣｈａｒＰｏｓ属性から、この文字は２、３または４文字長の単語の最初の文字として現れる可能性があることがわかる。更に、文字“昨”のＮｅｘｔＣｈａｒｓ属性から、この文字で始まる単語では、２番目の文字は、“儿”、“天”または“晩”のいずれかの可能性があることもわかる。
【００３０】
図４は、特定の単語が、他の更に小さい単語を含む可能性がる否かについて判定するために実行することが好ましいステップを示すフロー図である。英語との類似性として、英文からスペースおよび句読点記号を除去した場合、“ｂｅａｔ”という文字列は、単語“ｂｅａｔ”または２つの単語“ｂｅ”および“ａｔ”のいずれかとして解釈することが可能である。ステップ４０１において、単語が４つ以上の文字を含む場合、ファシリティはステップ４０２に進み、この単語は他の単語を含む可能性がないという結果を返す。それ以外の場合、ファシリティはステップ４０３に進む。ステップ４０３において、単語内の全ての文字が単一文字単語を構成することができる場合、ファシリティはステップ４０５に進み、それ以外の場合、ファシリティはステップ４０４に進み、単語は他の単語を含む可能性がないという結果を返す。ステップ４０５において、単語は派生接辞、即ち、接頭辞または接尾辞として頻繁に用いられる単語を含む場合、ファシリティはステップ４０６に進み、単語は他の単語を含む可能性がないという結果を返す。それ以外の場合、ファシリティはステップ４０７に進む。ステップ４０７において、単語内の隣接する文字対が言語の文中で隣接して現れる際分割されることが多い場合、ファシリティはステップ４０９に進み、単語は他の単語を含む可能性があるという結果を返す。それ以外の場合、ファシリティはステップ４０８に進み、単語は他の単語を含む可能性がないという結果を返す。
【００３１】
個々の単語が他のより小さい単語を含む可能性があるか否かについての判定結果を以下の表６に示す。
【００３２】
表６：単語語彙知識ベース・エントリ
【表６】

【００３３】
例えば、表６から、単語（Word）“昨天”は他の単語を含む可能性はなく、一方“天下”は他の単語を含む可能性があるとファシリティが判定したことがわかる。
【００３４】
図５は、文章をその構成単語に区分するためにファシリティが実行することが好ましいステップのフロー図である。これらのステップは、文章内で現れる言語の異なる単語を識別する単語リストを生成し、この単語リストをパーザに提出し、著者が文章を構成しようとした、単語リスト内の単語の部分集合を識別する。
【００３５】
ステップ５０１において、ファシリティは、文章内に現れた多文字単語を単語リストに追加する。ステップ５０１については、図６と関連付けて以下で更に詳しく論ずる。ステップ５０２において、ファシリティは、文章内に現れた単一文字単語を、単語リストに追加する。ステップ５０２については、図９と関連付けて以下で更に詳しく説明する。ステップ５０３において、ファシリティは、語彙レコードを生成する。これは、語彙パーザが、ステップ５０１および５０２において単語リストに追加した単語のために用いる。ステップ５０４において、ファシリティは、語彙レコードに確率を割り当てる。語彙レコードの確率は、当該語彙レコードが文章の正しい解析ツリーの一部である確度を反映し、パーザが解析プロセスにおいて語彙レコードの適用を命令するために用いる。パーザは、解析プロセス中、語彙レコードをその確率が小さくなる順に適用する。ステップ５０４については、図１０と関連付けて以下で更に詳しく論ずる。ステップ５０５において、ファシリティは構文パーザを利用して、語彙レコードを解析し、文章の構文構造を反映する解析ツリーを生成する。この解析ツリーは、ステップ５０３において生成した語彙レコードの部分集合を、そのリーフとして有する。ステップ５０６において、ファシリティは、解析ツリーのリーフである語彙レコードが表す単語を、文章の単語として識別する。ステップ５０６の後、これらのステップは終了する。
【００３６】
図６は、多文字単語を単語リストに追加するためにファシリティが実行することが好ましいステップを示すフロー図である。これらのステップは、文章内の現位置を用いて文章を分析し、多文字単語を識別する。更に、これらのステップは、図４に示したように、ファシリティが語彙知識ベースに追加したＣｈａｒＰｏｓ、ＮｅｘｔＣｈａｒおよびＩｇｎｏｒｅＰａｒｔｓ属性を利用する。第１の好適な実施形態によれば、ファシリティは、図６に示すステップの実行中、必要に応じて、語彙知識ベースからこれらの属性を検索する。第２の好適な実施形態では、文章内の文字のＮｅｘｔＣｈａｒ属性および／またはＣｈａｒＰｏｓ属性の値は、全て、図６に示すステップの実行前に、予めロードしてある。第２の好適な実施形態に関連して、文章内に現れた文字毎に、ＣｈａｒＰｏｓ属性の値を含む三次元アレイをメモリに格納することが好ましい。このアレイは、文章内の所与の位置における文字について、当該文字が所与の長さの単語において所与の位置にある可能性があるか否かについて示すものである。これらの属性の値をキャッシュすることによって、図６に示すステップを実行する際に、これらに正式にアクセスすることが可能となる。
【００３７】
ステップ６０１において、ファシリティは、文章の最初の文字にこの位置をセットする。ステップ６０２ないし６１４において、ファシリティは、位置が文章の終端まで進み終えるまで、ステップ６０３ないし６１３を繰り返し続ける。
【００３８】
ステップ６０３ないし６０９において、ファシリティは、現位置から開始する単語候補毎にループを繰り返す。ファシリティは、現位置から開始し７文字長である単語候補から開始し、繰り返し毎に、単語候補の終端から１つの文字を除去し、単語候補が２文字長になるまで続ける。現位置から開始する文章内に残っている文字が７つ未満の場合、ファシリティは、文章内に十分な文字が残っていない単語候補に対する繰り返しを省略することが好ましい。ステップ６０４において、ファシリティは、単語候補を構成する文字のＮｅｘｔＣｈａｒおよびＣｈａｒＰｏｓ属性に関係する現単語候補の条件を検査する。ステップ６０４については、図７と関連付けて以下で更に詳しく論ずる。ＮｅｘｔＣｈａｒおよびＣｈａｒＰｏｓ条件双方が単語候補に対して満たされる場合、ファシリティはステップ６０５に進み、それ以外の場合、ファシリティはステップ６０９に進む。ステップ６０５において、ファシリティは、語彙知識ベース内で単語候補を調べ、当該単語候補が単語であるか否かについて判定を行なう。ステップ６０６において、単語候補が単語である場合、ファシリティはステップ６０７に進み、それ以外の場合、ファシリティはステップ６０９に進む。ステップ６０７において、ファシリティは、文章内に現れた単語のリストに、この単語候補を追加する。ステップ６０８において、単語候補が他の単語を含む可能性がある場合、即ち、この単語のＩｇｎｏｒｅＰａｒｔｓ属性がクリア（clear）の場合、ファシリティはステップ６０９に進み、それ以外の場合、ファシリティはステップ６１１に進む。ステップ６０９において、処理すべき単語候補が未だ残っている場合、ファシリティはステップ６０３に進み、次の単語候補を処理する。それ以外の場合、ファシリティはステップ６１０に進む。ステップ６１０において、ファシリティは、文章の終端に向かって現位置を１文字だけ進ませる。ステップ６１０の後、ファシリティはステップ６１４に進む。
【００３９】
ステップ６１１において、単語候補の最後の文字が、同様に単語であり得る他の単語候補と重複する場合、ファシリティはステップ６１３に進み、それ以外の場合、ファシリティはステップ６１２に進む。ステップ６１１については、図８と関連付けて以下で更に詳しく論ずる。ステップ６１２において、ファシリティは、文章内の単語候補の最後の文字の後ろにある文字に位置を進ませる。ステップ６１２の後、ファシリティはステップ６１４に進む。ステップ６１３において、ファシリティは、現単語候補の最後の文字に位置を進ませる。ステップ６１３の後、ファシリティはステップ６１４に進む。ステップ６１４において、位置が文章の終端でない場合、ファシリティはステップ６０２に進み、新たな単語候補群を検討する。それ以外の場合、これらのステップは終了する。
【００４０】
図７は、単語候補に対してＮｅｘｔＣｈａｒおよびＣｈａｒＰｏｓ条件を検査するためにファシリティが実行することが好ましいステップを示すフロー図である。ステップ７０１において、単語候補の２番目の文字が、単語候補の最初の文字のＮｅｘｔＣｈａｒリスト内にある場合、ファシリティはステップ７０３に進む。それ以外の場合、ファシリティはステップ７０２に進み、条件を双方共満足したという結果を返す。ステップ７０３ないし７０６において、ファシリティは、単語候補内の文字位置毎にループを繰り返す。ステップ７０４において、単語候補の現位置および長さで構成した整列対が、現文字位置における文字に対するＣｈａｒＰｏｓリスト内の整列対の中にある場合、ファシリティはステップ７０６に進み、それ以外の場合、ファシリティはステップ７０５に進み、双方の条件を満たしてはいないという結果を返す。ステップ７０６において、単語候補内に処理すべき文字位置が未だ残っている場合、ファシリティはステップ７０３に進み、単語候補内の次の文字位置を処理する。それ以外の場合、ファシリティはステップ７０７に進み、単語候補が双方の条件を満足したという結果を返す。
【００４１】
図８は、現単語候補の最後の文字が、単語であり得る別の単語候補と重複するか否かについて判定を行なうためにファシリティが実行することが好ましいステップを示すフロー図である。ステップ８０１において、単語候補の後ろにある文字が、当該単語候補の最後の文字に対するＮｅｘｔＣｈａｒ属性における文字リスト内にある場合、ファシリティはステップ８０３に進む。それ以外の場合、ファシリティはステップ８０２に進み、重複はないという結果を返す。ステップ８０３において、ファシリティは、語彙知識ベースにおいて、単語候補を、その最後の文字を除いて調べ、最後の文字を除いた単語候補が単語となるか否かについて判定を行なう。ステップ８０４において、最後の文字を除いた単語候補が単語になる場合、ファシリティはステップ８０６に進み、重複があるという結果を返す。それ以外の場合、ファシリティはステップ８０５に進み、重複がないという結果を返す。
【００４２】
前述の例に関する、図６に示したステップの実行を、以下の表７に示す。
【００４３】
表７：検討した文字の組み合わせ
【表７】

【００４４】
表７は、ファシリティが検討したサンプル文章からの文字の５３通りの組み合わせ（combination）の各々について、ＣｈａｒＰｏｓ検査の結果、ＮｅｘｔＣｈａｒｓ検査の結果、ファシリティが語彙知識ベース内で当該単語を調べたか否か（look up?）、そして語彙知識ベースが、文字の組み合わせが単語になることを示したか否か（is a word?）を示すものである。
【００４５】
組み合わせ１ないし４は、ＣｈａｒＰｏｓ検査で不合格（fail）であったことがわかる（fail on 昨）。何故なら、文字“昨”のＣｈａｒＰｏｓ属性は、整列対（１，７）、（１，６）、（１，５）または（１，４）を含まないからである。一方、組み合わせ５および６では、ＣｈａｒＰｏｓ検査およびＮｅｘｔＣｈａｒｓ検査双方共、合格（pass）である。したがって、ファシリティは、組み合わせ５および６を語彙知識ベース内で調べ、組み合わせ５は単語ではないが、組み合わせ６は単語であると判定する。組み合わせ６を処理し、現在位置からどれだけ進ませるか決定した後、ファシリティは、ＩｇｎｏｒｅＰａｒｔｓ属性がセット（set）されているが、単語“昨天”は文字“天”で始まるある単語候補と重複することを判定する。したがって、ファシリティは、ステップ６１３にしたがって、組み合わせ６の終端にある文字“天”まで進む。組み合わせ７〜１２では、組み合わせ１２のみがＣｈａｒＰｏｓ検査およびＮｅｘｔＣｈａｒｓ検査双方に合格している。したがって、組み合わせ１２を調べ、単語であると判定する。組み合わせ１２を処理し、現在位置をどれだけ進ませるか決定した後、ファシリティは、組み合わせ１２が構成する単語のＩｎｇｏｒｅＰａｒｔｓ属性がクリア（clear）であることを判定し、したがって、現位置を、組み合わせ１２に続く文字ではなく、文字“下”まで１文字進ませる。
【００４６】
更に、組み合わせ１８、２４、３７および４３は、ＩｇｎｏｒｅＰａｒｔｓ属性がセットされ、単語であり得るいずれの単語候補ともそれらの最後の文字が重複しない単語であることもわかる。したがって、各々を処理した後、ファシリティは、ステップ６１２にしたがって、当該文字の組み合わせに続く文字まで現位置を進ませることによって、これらの４つの組み合わせの各々に対して、不必要に４１個の余分な組み合わせまで処理することを省略する。
【００４７】
更に、組み合わせ２３および５０が構成する単語のＩｇｎｏｒｅＰａｒｔｓ属性はクリアであることもわかる。このため、ファシリティは、これらの組み合わせを処理した後、ステップ６１０にしたがって、１文字だけ現位置を進ませる。
【００４８】
更に、２文字の組み合わせ３０、３６、４７および５２は、ファシリティが単語を構成するとは判定しなかったこともわかる。したがって、ファシリティは、ステップ６１０にしたがって、これらの組み合わせを処理した後、１文字だけ現位置を進ませる。結局、ファシリティは、サンプル文章において可能な１１２個の組み合わせの内、わずか１４個のみを調べたに過ぎない。ファシリティが調べた１４個の組み合わせの内、９つは実際の単語である。
【００４９】
表８に示すように、表７と関連付けて説明した処理の後、単語リストは、組み合わせ６、１２、１８、２３，２４，３７、４３、５０および５３で構成された単語を含む（名詞（noun）、動詞（verb）、代名詞（pronoun）の品詞（part of speech）も示す）。
【００５０】
表８：多文字単語の単語リスト
【表８】

【００５１】
図９は、単一文字単語を単語リストに追加するためにファシリティが実行することが好ましいステップを示すフロー図である。ステップ９０１ないし９０６において、ファシリティは、文章における文字毎に、最初の文字から最後の文字まで、ループを繰り返す。ステップ９０２において、ファシリティは、語彙知識ベース内にあるそのエントリに基づいて、当該文字が単一文字単語を構成するか否かについて判定を行い、構成しない場合、ファシリティは、単語リストに文字を追加せずに、ステップ９０６に進む。文字が単一文字単語を構成する場合、ファシリティはステップ９０３に進み、それ以外の場合、ファシリティはステップ９０６に進み、単語リストに文字を追加しない。ステップ９０３において、他の単語を含む可能性がない単語、即ち、既に単語リスト上にありそのＩｇｎｏｒｅＰａｒｔｓ属性がセットされている単語にこの文字が含まれる場合、ファシリティはステップ９０４に進み、それ以外の場合、ファシリティはステップ９０５に進み、この文字を単語リストに追加する。ステップ９０４において、この文字が、単語リスト上の別の単語と重複する、単語リスト上の別の単語内に含まれている場合、ファシリティはステップ９０６に進み、この文字を単語リストに追加しない。それ以外の場合、ファシリティはステップ９０５に進む。ステップ９０５において、ファシリティは、現文字を構成する単一文字単語を単語リストに追加する。ステップ９０６において、文章内に未だ処理すべき文字が残っている場合、ファシリティはステップ９０１に進み、文章内の次の文字を処理する。それ以外の場合、これらのステップは終了する。
【００５２】
以下の表９は、図９に示すステップを実行する際に、ファシリティが単語リストに追加した単一文字単語５４〜６１を示す（名詞（noun）、形態素（morpheme）、名詞（場所限定語）（noun(localizer）、動詞（verb）、前置詞（preposition）、副詞（adverb）、機能語（function word）、代名詞（pronoun）、名詞（分類辞）（noun(classifier)）の品詞（part of speech）も示す）。
【００５３】
表９：単一および多文字単語の単語リスト
【表９】

【００５４】
多文字単語および単一文字単語を単語リストに追加し、これらの単語に対する語彙レコードを生成した後、ファシリティは、語彙レコードに確率を割り当てる。これは、パーザが、解析プロセスにおいて語彙レコードの適用を順序付ける際に用いる。以下で論ずる図１０および図１１は、ファシリティが語彙レコードに確率を割り当てるために用いる２つの代替手法を示す。
【００５５】
図１０は、第１手法にしたがって単語リスト内の単語から生成した語彙レコードに確率を割り当てるためにファシリティが実行することが好ましいステップを示すフロー図である。ファシリティは、究極的には、語彙レコード毎の確率を、パーザに解析プロセス中早期に語彙レコードを検討させる高い確率値、またはパーザに解析プロセス中後期に語彙レコードを検討させる低い確率値のいずれかにセットすることが好ましい。ステップ１００１ないし１００５において、ファシリティは、単語リスト内における単語毎にループを繰り返す。ステップ１００２において、現単語が単語リスト内にあるより大きな単語に含まれる場合、ファシリティはステップ１００４に進み、それ以外の場合、ファシリティはステップ１００３に進む。ステップ１００３において、ファシリティは、この単語を表す語彙レコードの確率を、高い確率値にセットする。ステップ１００３の後、ファシリティはステップ１００５に進む。ステップ１００４において、ファシリティは、その単語を表す語彙レコードの確率を、低い確率値にセットする。ステップ１００４の後、ファシリティはステップ１００５に進む。ステップ１００５において、単語リスト内に未だ処理すべき単語が残っている場合、ファシリティはステップ１００１に進み、単語リスト内にある次の単語を処理する。それ以外の場合、これらのステップは終了する。
【００５６】
以下の表１０は、図１０に示すステップにしたがって、単語リスト内の各単語に割り当てた確率値（probability value）を示す。確率を調べることにより、ファシリティは各文字を含む少なくとも１つの単語に高い（high）確率値を割り当てており、各文字を含む少なくとも１つの語彙レコードを解析プロセスの早期において検討するようにしていることがわかる。
【００５７】
表１０：確率を加えた単語リスト
【表１０】

【００５８】
図１１は、第２手法にしたがって単語リスト内の単語から発生した語彙レコードに確率を割り当てるためにファシリティが実行することが好ましいステップを示すフロー図である。ステップ１１０１において、ファシリティは、単語リストを用いて、単語リスト内の単語で全体的に構成されている文章に可能な全ての区分を特定する。ステップ１１０２において、ファシリティは、ステップ１１０１において特定した、可能な区分の内、含む単語数が最も少ない１つ以上の区分を選択する。最少数の単語を有する可能な区分が１つよりも多い場合、ファシリティは、このような可能な区分の各々を選択する。
【００５９】
以下の表１１は、表９に示した単語リストから生成した、最も少ない単語（９個）を有する、可能な区分を示す。
【００６０】
【表１１】

【００６１】
ステップ１１０３において、ファシリティは、選択した区分（複数）における単語の語彙レコードの確率を高い確率値にセットする。ステップ１１０４において、ファシリティは、選択した区分（複数）にない単語の語彙レコードの確率を低い確率値にセットする。ステップ１１０４の後、これらのステップは終了する。
【００６２】
以下の表１２は、図１１に示すステップにしたがって、単語リスト内にある各単語に割り当てた確率値（probability value）を示す。確率を調べることにより、ファシリティは各文字を含む少なくとも１つの単語に高い（high）確率値を割り当てており、各文字を含む少なくとも１つの語彙レコードを解析プロセスの早期において検討するようにしていることがわかる。
【００６３】
表１２：確率を加えた単語リスト
【表１２】

【００６４】
図１２は、パーザが生成した、サンプル文章の構文構造を表す解析ツリーを示す解析ツリー図である。解析ツリーは、単一の文章レコード１２３１をその頭部として有し、かつ多数の語彙レコード１２０１〜１２１１をそのリーフとして有する階層構造であることがわかる。更に、解析ツリーは、単語を表す各語彙レコードを組み合わせて、１つ以上の単語を表すより大きな構文構造にする、中間構文レコード１２２１〜１２２７も有する。例えば、前置詞句レコード１２２３は、前置詞（preposition）を表す語彙レコード１２０４および名詞（noun）を表す語彙レコード１２０６を組み合わせる。図５のステップ５０６にしたがって、ファシリティは、解析ツリー内にある語彙レコード１２０１〜１２１１が表す単語を、サンプル文章を区分すべき単語として特定する。ファシリティがこの解析ツリーを保有して、文章に対して更に別の自然言語処理を実行するようにしてもよい。
【００６５】
以上、好適な実施形態を参照しながら本発明について示しかつ説明したが、本発明の範囲から逸脱することなく、形態および詳細において種々の変化または変更が可能であることは、当業者には認められよう。例えば、中国語以外の言語においても、単語の区分を行なうために、前述のファシリティの特徴（ａｓｐｅｃｔ）を適用することができる。更に、ここに記載した技術の適当な部分集合または上位集合も、単語の区分を実行するために適用することができる。
【図面の簡単な説明】
【図１】ファシリティが実行するすることが好ましい汎用コンピュータ・システムの上位ブロック図である。
【図２】ファシリティが動作することが好ましい２つのフェーズを示す概略フロー図である。
【図３】初期化フェーズにおいて語彙知識ベースを増強し、単語の区分を実行する際に用いる情報を含ませるために、ファシリティが実行することが好ましいステップを示すフロー図である。
【図４】特定の単語が、他の更に小さい単語を含む可能性があるか否かについて判定するために実行することが好ましいステップを示すフロー図である。
【図５】文章をその構成単語に区分するためにファシリティが実行することが好ましいステップのフロー図である。
【図６】多文字単語を単語リストに追加するためにファシリティが実行することが好ましいステップを示すフロー図である。
【図７】単語候補に対してＮｅｘＣｈａｒおよびＣｈａｒＰｏｓ条件を検査するためにファシリティが実行することが好ましいステップを示すフロー図である。
【図８】現単語候補の最後の文字が、単語であり得る別の単語候補と重複するか否かについて判定を行なうためにファシリティが実行することが好ましいステップを示すフロー図である。
【図９】単一文字単語を単語リストに追加するためにファシリティが実行することが好ましいステップを示すフロー図である。
【図１０】第１手法にしたがって単語リスト内の単語から発生した語彙レコードに確率を割り当てるためにファシリティが実行することが好ましいステップを示すフロー図である。
【図１１】第２手法にしたがって単語リスト内の単語から発生した語彙レコードに確率を割り当てるためにファシリティが実行することが好ましいステップを示すフロー図である。
【図１２】サンプル文章の構文構造を表す、パーザが生成する解析ツリーを示す解析ツリー図である。[0001]
(Technical field)
The present invention relates generally to the field of natural language processing, and more specifically to the field of word segmentation.
[0002]
(Background of the Invention)
Word segmentation is the process of identifying individual words that make up a language representation such as a sentence. Word division is useful for checking spelling and grammar, synthesizing speech from sentences, and analyzing and understanding natural language. All of these are the effects obtained by identifying individual words.
[0003]
In English, it is rather simple to segment words. That is, space and punctuation marks delimit individual words in the sentence. Consider the English sentences in Table 1 below.
[0004]
[Table 1]

[0005]
By identifying a series of adjacent spaces and / or punctuation marks as the end of a word preceding a series of spaces, the English text in Table 1 can be simply segmented as shown in Table 2 below.
[0006]
[Table 2]

[0007]
In Chinese sentences, word boundaries are implicit rather than explicit. Consider the text in Table 3 below. This means that the committee discussed this issue yesterday afternoon in Buenos Aires.
[0008]
[Table 3]

[0009]
Even though there are no punctuation marks or spaces in the text, a Chinese reader will recognize the text in Table 3 as being composed of words that are underlined and distinguished in Table 4 below.
[0010]
[Table 4]

[0011]
From the above example, it can be seen that the Chinese word division cannot be done in the same way as the English word division. Therefore, a highly accurate and efficient method for automatically classifying Chinese will have great utility.
[0012]
(Summary of Invention)
According to the present invention, when a word segmentation software facility ("facility") performs word segmentation operations on sentences in a non-segmentation language such as Chinese, (1) possible combinations of characters in the input sentence. Evaluate and discard words that are not likely to represent words in the input sentence, (2) examine the remaining character combinations in the dictionary, determine if they can constitute a word, and (3) The combination of characters determined to be present is submitted to the natural language parser as an alternative vocabulary record representing the input sentence. The parser generates a parse tree that represents the syntactic structure of the input sentence. This includes only vocabulary records that represent combinations of characters that have proven to be words in the input sentence. When submitting a vocabulary record to a parser, the facility weights the vocabulary records so that the parser considers long character combinations before short character combinations. This is because, in general, a combination of long characters often represents a correct division of a sentence than a combination of short characters.
[0013]
In order to easily discard combinations of characters that are unlikely to represent words in the input sentence, the facility will: for each character appearing in the dictionary, (1) indicate all combinations with different word lengths and character positions where the words appear; And (2) when this character starts a word, add to the dictionary an indication of all the characters that may follow this character. Furthermore, the facility also adds (3) instructions for multi-character words as to whether or not partial words can exist in multi-character words and should be considered. When processing a sentence, the facility uses a combination of characters that are used in a word length / position combination where no word is in the dictionary, and (2) a second that allows the second character to be the first character. Discard character combinations that are not listed as characters. Furthermore, the facility discards (3) combinations of characters that appear in words that do not consider partial words.
[0014]
In this way, the facility minimizes the number of character combinations examined in the dictionary and differentiates between the results of segmentation options each composed of valid words using the syntactic context of the sentence.
[0015]
(Detailed description of the invention)
The present invention performs word segmentation in Chinese sentences. In a preferred embodiment, when a word segmentation software facility (“facility”) performs word segmentation of sentences in a non-segmented language such as Chinese, (1) evaluates possible combinations of characters in the input sentence. Then, discard words that are not likely to represent words in the input sentence, (2) examine the remaining character combinations in the dictionary, determine whether these can constitute words, and (3) words Are submitted to the natural language parser as an alternative vocabulary record representing the input sentence. The parser generates a parse tree that represents the syntax structure of the input sentence. This includes only vocabulary records that represent combinations of characters that have proven to be words in the input sentence. When submitting a vocabulary record to a parser, the facility weights the vocabulary records so that the parser considers long character combinations before short character combinations. This is because, in general, a combination of long characters often represents a correct division of a sentence than a combination of short characters.
[0016]
In order to easily discard combinations of characters that are unlikely to represent words in the input sentence, the facility will: for each character appearing in the dictionary, (1) indicate all combinations with different word lengths and character positions where the words appear; And (2) When this character is at the beginning of a word, an indication of all characters that may follow this character is added to the dictionary. In addition, the facility adds (3) an indication as to whether or not a partial word can exist in the multi-character word and should be considered for the multi-character word. When processing a sentence, the facility allows (1) a combination of characters used in a word length / position combination where no word is in the dictionary, and (2) the second character can be the first character Discard any combination of characters not listed as a second character. Furthermore, the facility discards (3) combinations of characters that appear in words that do not consider partial words.
[0017]
In this way, the facility differentiates between alternative segment results, each composed of valid words, by minimizing the number of character combinations looked up in the dictionary and using the syntactic context of the sentence.
[0018]
FIG. 1 is a high-level block diagram of a general purpose computer system that the facility preferably executes. The computer system 100 includes a central processing unit (CPU) 110, an input / output device 120, and a computer memory (memory) 130. Between the input / output devices, a computer readable medium is provided on a storage device 121 such as a hard disk drive and a computer readable medium such as a CD-ROM and can be used to install software products including facilities. There is a possible media drive 122 and a network connection 123 through which the computer 100 can communicate with other connected computer systems (not shown). The memory 130 is a syntax parser that generates an analysis tree that represents the syntax structure of a sentence of a natural language sentence from a word classification facility 131 that identifies individual words that appear in a Chinese sentence and a vocabulary record that represents a word that appears in the natural language sentence. 133, and the parser preferably builds a vocabulary record for the parse tree, and the facility includes a vocabulary knowledge base 132 that identifies words that appear in natural language sentences. Those skilled in the art will appreciate that the facility is preferably implemented on a computer system configured as described above, but can also be implemented on a computer system having a different configuration.
[0019]
FIG. 2 is a schematic flow diagram illustrating the two phases in which the facility preferably operates. In step 201, as part of the initialization phase, the facility augments the vocabulary knowledge base to include information that the facility uses when performing word segmentation. Step 201 is discussed in more detail below in connection with FIG. In short, in step 201, the facility adds an entry to the vocabulary knowledge base for characters that appear in any word in the vocabulary knowledge base. The entry added for each character includes a CharPos attribute that represents a different location where the character appears in the word. In addition, the entry for each character also includes a NextChars attribute that indicates the set of characters that appear at the second position of the word starting with the current character. Finally, the facility also adds an IgnoreParts attribute to each word that appears in the vocabulary knowledge base. This indicates whether the character string that constitutes the word should also be considered to constitute a smaller word that together constitute the current word.
[0020]
After step 201, the facility proceeds to step 202, ends the initialization phase, and starts the word segmentation phase. In the word segmentation phase, the facility performs word segmentation of Chinese sentences using information added to the vocabulary knowledge base. In step 202, the facility receives a Chinese sentence for the word segmentation. In step 203, the facility classifies the received sentence into its constituent words. Step 203 is discussed in more detail below in connection with FIG. In short, the facility looks up small fragments of all possible consecutive combinations of characters in a sentence in the vocabulary knowledge base. The facility then submits to the syntax parser the examined combination of characters that the vocabulary knowledge base has shown to be words. When the parser determines the syntactic structure of a sentence, it identifies the combination of characters that the sentence author intends to compose words in the sentence. After step 203, the facility proceeds to step 202 and receives the next sentence for the word segment.
[0021]
FIG. 3 is a flow diagram illustrating the steps that the facility preferably performs to augment the vocabulary knowledge base in the initialization phase and include information used in performing word segmentation. These steps (a) for a character appearing in a word in the vocabulary knowledge base, add an entry to the vocabulary knowledge base, (b) add CharPos and NextChars attributes to the entry for this character in the vocabulary knowledge base; (C) Add an IgnoreParts attribute to the entry for the word in the vocabulary knowledge base.
[0022]
The facility repeats the loop of steps 301 to 312 for each word entry in the vocabulary knowledge base. In step 302, the facility repeats the loop for each character position in the word. That is, for a word that includes three characters, the facility loops over the first, second, and third characters of the word. In step 303, if the character at the current character position has an entry in the vocabulary knowledge base, the facility proceeds to step 305, otherwise the facility proceeds to step 304. In step 304, the facility adds an entry for the current character to the lexical knowledge base. After step 304, the facility proceeds to step 305. In step 305, the facility adds an ordered pair to the CharPos attribute stored in the entry for the character in the vocabulary knowledge base, and that character may appear at a position that appears in the current word. It shows that. The added alignment pair has a form of (position, length), where position is a position occupied by the character in the word, and length is the number of characters in the word. For example,

[0023]
For the character “delegation” in the word, the facility adds the alignment pair (1,3) to the list of alignment pairs stored in the CharPos attribute in the lexical knowledge base entry for the character “delegation”. Preferably, the facility does not add an alignment pair as described in step 305 if the alignment pair is already included in the CharPos attribute for the current word. In step 306, if there are more characters left in the current word to be processed, the facility proceeds to step 302 to process the next character. Otherwise, the facility proceeds to step 307.
In step 307, if the word is a single character word, the facility proceeds to step 309, otherwise the facility proceeds to step 308. In step 308, the facility adds the character at the second position of the current word to the character list in the NextChars attribute in the lexical knowledge base record of the character at the first position of the current word. For example,

[0024]
In the word, the facility is a letter

[0025]
Is added to the character list stored for the NextChars attribute of the character “delegation”. After step 308, the facility proceeds to step 309.
In step 309, if the current word can contain other smaller words, the facility proceeds to step 311, otherwise the facility proceeds to step 310. Step 309 is discussed in more detail below in connection with FIG. In short, the facility uses a number of heuristics to determine if there is a string that makes up the current word, and that in some context this string may make up two or more smaller words. A determination is made as to whether or not.
[0026]
In step 310, the facility sets the IgnoParts attribute of the word in the lexical knowledge base entry for the word. Setting the IgnoParts attribute means that if the facility finds this word in the sentence of the input sentence, it does not perform any further steps to determine whether this word contains a smaller word. After step 310, the facility proceeds to step 312. In step 311, the current word may contain other words, so the facility clears the IgnoreParts attribute for that word, and if the word is found in the sentence of the input sentence, the facility Proceed with research on whether or not it contains smaller words. After step 311, the facility proceeds to step 312. In step 312, if there are more words left in the vocabulary knowledge base to be processed, the facility proceeds to step 301 to process the next word. Otherwise, these steps are finished.
[0027]
When the facility performs the steps shown in Figure 3 and augments the vocabulary knowledge base by assigning CharPos and NextChars attributes to each character, it appears in the sample sentence shown in Table 3 as shown in Table 5 below. Assign these attributes to each character.
[0028]
Table 5: Character vocabulary knowledge base entry
[Table 5]

[0029]
From the table of FIG. 5, for example, it can be seen from the CharPos attribute of the character “Yesterday” that this character may appear as the first character of a word that is 2, 3 or 4 characters long. Further, it can be seen from the NextChars attribute of the character “Yesterday” that the second character may be “儿”, “heaven”, or “evening” in a word starting with this character.
[0030]
FIG. 4 is a flow diagram illustrating the steps that are preferably performed to determine whether a particular word may contain other smaller words. Similar to English, if spaces and punctuation marks are removed from English text, the string “beat” can be interpreted as either the word “beat” or the two words “be” and “at” It is. In step 401, if the word contains more than three characters, the facility proceeds to step 402 and returns a result that this word may not contain other words. Otherwise, the facility proceeds to step 403. In step 403, if all the characters in the word can constitute a single character word, the facility proceeds to step 405, otherwise the facility proceeds to step 404, where the word may contain other words. Returns no result. In step 405, if the word contains a derived affix, ie, a word that is frequently used as a prefix or suffix, the facility proceeds to step 406 and returns a result that the word may not contain other words. Otherwise, the facility proceeds to step 407. In step 407, if adjacent character pairs in a word are often split when appearing adjacently in a sentence of the language, the facility proceeds to step 409 and results in that the word may contain other words. return. Otherwise, the facility proceeds to step 408 and returns a result that the word may not contain other words.
[0031]
Table 6 below shows the determination results as to whether an individual word may contain other smaller words.
[0032]
Table 6: Word Vocabulary Knowledge Base Entry
[Table 6]

[0033]
For example, it can be seen from Table 6 that the facility has determined that the word “Yesteren” may not contain other words, while “World” may contain other words.
[0034]
FIG. 5 is a flow diagram of the steps preferably performed by the facility to divide a sentence into its constituent words. These steps generate a word list that identifies words in different languages that appear in the sentence, submit this word list to the parser, and identify the subset of words in the word list that the author tried to compose the sentence. To do.
[0035]
In step 501, the facility adds multi-letter words that appear in the sentence to the word list. Step 501 is discussed in further detail below in connection with FIG. In step 502, the facility adds a single character word that appears in the sentence to the word list. Step 502 is described in more detail below in connection with FIG. In step 503, the facility generates a vocabulary record. This is used for words that the vocabulary parser added to the word list in

steps

501 and 502. In step 504, the facility assigns a probability to the vocabulary record. The probability of a vocabulary record reflects the probability that the vocabulary record is part of the correct parse tree of a sentence and is used by the parser to command the application of the vocabulary record in the analysis process. The parser applies vocabulary records in order of decreasing probability during the analysis process. Step 504 is discussed in more detail below in connection with FIG. In step 505, the facility uses a syntax parser to parse the vocabulary records and generate an analysis tree that reflects the syntactic structure of the sentence. This parse tree has a subset of the vocabulary records generated in step 503 as its leaves. In step 506, the facility identifies the word represented by the vocabulary record that is the leaf of the parse tree as the word of the sentence. After step 506, these steps are finished.
[0036]
FIG. 6 is a flow diagram illustrating the steps that the facility preferably performs to add a multi-character word to the word list. These steps use the current position in the sentence to analyze the sentence and identify multi-letter words. Furthermore, these steps utilize the CharPos, NextChar and IgnoParts attributes that the facility has added to the lexical knowledge base, as shown in FIG. According to the first preferred embodiment, the facility retrieves these attributes from the vocabulary knowledge base as needed during execution of the steps shown in FIG. In the second preferred embodiment, the values of the NextChar and / or CharPos attributes of the characters in the sentence are all pre-loaded prior to performing the steps shown in FIG. In connection with the second preferred embodiment, it is preferable to store in memory a three-dimensional array containing the value of the CharPos attribute for each character that appears in the sentence. The array indicates for a character at a given position in a sentence whether the character may be at a given position in a word of a given length. By caching the values of these attributes, they can be formally accessed when executing the steps shown in FIG.
[0037]
In step 601, the facility sets this position to the first character of the sentence. In steps 602 through 614, the facility continues to repeat steps 603 through 613 until the position has been advanced to the end of the sentence.
[0038]
In steps 603 through 609, the facility repeats the loop for each word candidate starting from the current position. The facility starts with a word candidate that starts at the current position and is 7 characters long, and for each iteration, removes one character from the end of the word candidate and continues until the word candidate is 2 characters long. If there are fewer than seven characters remaining in the sentence starting from the current position, the facility preferably omits repetition for word candidates that do not have enough characters remaining in the sentence. In step 604, the facility checks the conditions of the current word candidate related to the NextChar and CharPos attributes of the characters that make up the word candidate. Step 604 is discussed in more detail below in connection with FIG. If both the NextChar and CharPos conditions are met for the word candidate, the facility proceeds to step 605; otherwise, the facility proceeds to step 609. In step 605, the facility looks up the word candidate in the lexical knowledge base and determines whether the word candidate is a word. In step 606, if the word candidate is a word, the facility proceeds to step 607, otherwise the facility proceeds to step 609. In step 607, the facility adds this word candidate to the list of words that appear in the sentence. In step 608, if the word candidate may contain other words, that is, if the IgnoParts attribute of this word is clear, the facility proceeds to step 609, otherwise the facility proceeds to step 611. move on. In step 609, if there are more word candidates to be processed, the facility proceeds to step 603 to process the next word candidate. Otherwise, the facility proceeds to step 610. In step 610, the facility advances the current position by one character toward the end of the sentence. After step 610, the facility proceeds to step 614.
[0039]
In step 611, if the last character of the word candidate overlaps with another word candidate that may also be a word, the facility proceeds to step 613; otherwise, the facility proceeds to step 612. Step 611 is discussed in more detail below in connection with FIG. In step 612, the facility advances the position to the character after the last character of the word candidate in the sentence. After step 612, the facility proceeds to step 614. In step 613, the facility advances the position to the last character of the current word candidate. After step 613, the facility proceeds to step 614. In step 614, if the position is not at the end of the sentence, the facility proceeds to step 602 to consider a new word candidate group. Otherwise, these steps are finished.
[0040]
FIG. 7 is a flow diagram illustrating the steps that the facility preferably performs to check the NextChar and CharPos conditions for word candidates. In step 701, if the second character of the word candidate is in the NextChar list of the first character of the word candidate, the facility proceeds to step 703. Otherwise, the facility proceeds to step 702 and returns a result that both conditions are satisfied. In steps 703 to 706, the facility repeats the loop for each character position in the word candidate. In step 704, if the alignment pair composed of the current position and length of the word candidate is in the alignment pair in the CharPos list for the character at the current character position, the facility proceeds to step 706, otherwise the facility. Advances to step 705 and returns a result that both conditions are not satisfied. In step 706, if there are still character positions to be processed in the word candidate, the facility proceeds to step 703 to process the next character position in the word candidate. Otherwise, the facility proceeds to step 707 and returns a result that the word candidate satisfies both conditions.
[0041]
FIG. 8 is a flow diagram illustrating the steps that the facility preferably performs to determine whether the last character of the current word candidate overlaps with another word candidate that may be a word. In step 801, if the character after the word candidate is in the character list in the NextChar attribute for the last character of the word candidate, the facility proceeds to step 803. Otherwise, the facility proceeds to step 802 and returns a result that there are no duplicates. In step 803, the facility looks up the word candidate in the vocabulary knowledge base, excluding its last character, and determines whether the word candidate excluding the last character is a word. In step 804, if the word candidate excluding the last character becomes a word, the facility proceeds to step 806 and returns a result that there is a duplicate. Otherwise, the facility proceeds to step 805 and returns a result that there are no duplicates.
[0042]
The execution of the steps shown in FIG. 6 for the above example is shown in Table 7 below.
[0043]
Table 7: Character combinations considered
[Table 7]

[0044]
Table 7 shows the results of the CharPos test, the NextChars test, and whether the facility has examined the word in the vocabulary knowledge base for each of the 53 combinations of characters from the sample sentences examined by the facility ( look up?), and whether the lexical knowledge base has shown that the combination of characters is a word (is a word?).
[0045]
It can be seen that combinations 1 through 4 failed in the CharPos test (fail on yes). This is because the CharPos attribute of the character “Yesterday” does not include the alignment pair (1,7), (1,6), (1,5) or (1,4). On the other hand, in combinations 5 and 6, both the CharPos test and the NextChars test pass. Thus, the facility looks at combinations 5 and 6 in the vocabulary knowledge base and determines that combination 5 is not a word but combination 6 is a word. After processing combination 6 and determining how far to advance from the current position, the facility has the IgnoParts attribute set, but the word “Yesteri” overlaps with a word candidate starting with the letters “heaven” Judge that. Thus, the facility proceeds to the character “heaven” at the end of combination 6 according to step 613. For combinations 7-12, only combination 12 passes both the CharPos and NextChars tests. Therefore, the combination 12 is examined and determined to be a word. After processing the combination 12 and determining how much to advance the current position, the facility determines that the IngoParts attribute of the word that the combination 12 constitutes is clear, and thus the current position is changed to the combination 12 Advances one character to the character “below”, not the character following.
[0046]
In addition, combinations 18, 24, 37 and 43 are also set with the IgnoParts attribute, and it can also be seen that any word candidate that may be a word is a word whose last character does not overlap. Thus, after processing each, the facility unnecessarily needs 41 extras for each of these four combinations by advancing the current position to the character following the character combination according to step 612. The processing up to the correct combination is omitted.
[0047]
Further, it can be seen that the IgnoParts attribute of the words that the combinations 23 and 50 constitute is clear. Thus, after processing these combinations, the facility advances the current position by one character according to step 610.
[0048]
It can also be seen that the two-letter combinations 30, 36, 47 and 52 have not determined that the facility constitutes a word. Thus, the facility advances the current position by one character after processing these combinations according to step 610. In the end, the facility only examined 14 of the 112 possible combinations in the sample sentence. Of the 14 combinations examined by the facility, 9 are actual words.
[0049]
As shown in Table 8, after the processing described in association with Table 7, the word list includes words composed of combinations 6, 12, 18, 23, 24, 37, 43, 50, and 53 (nouns ( noun), verbs, and pronoun parts of speech.
[0050]
Table 8: Word list of multi-character words
[Table 8]

[0051]
FIG. 9 is a flow diagram illustrating the steps that the facility preferably performs to add a single character word to the word list. In steps 901 through 906, the facility repeats the loop from the first character to the last character for each character in the sentence. In step 902, the facility determines whether the character constitutes a single character word based on its entry in the vocabulary knowledge base, and if not, the facility adds the character to the word list. Instead, the process proceeds to step 906. If the character constitutes a single character word, the facility proceeds to step 903; otherwise, the facility proceeds to step 906 and does not add the character to the word list. In step 903, if the word is included in a word that may not contain other words, that is, a word that is already on the word list and has its IgnoParts attribute set, the facility proceeds to step 904, otherwise If so, the facility proceeds to step 905 and adds this character to the word list. In step 904, if the character is contained within another word on the word list that overlaps another word on the word list, the facility proceeds to step 906 and does not add the character to the word list. Otherwise, the facility proceeds to step 905. In step 905, the facility adds the single character words that make up the current character to the word list. In step 906, if there are more characters to be processed in the sentence, the facility proceeds to step 901 to process the next character in the sentence. Otherwise, these steps are finished.
[0052]
Table 9 below shows the single-letter words 54-61 that the facility has added to the word list when performing the steps shown in FIG. 9 (noun, morpheme, noun (location-limited word)). noun (localizer), verb (verb), preposition (adverb), function word (pronun), noun (classifier) (part of speech) Also shown).
[0053]
Table 9: Word list for single and multi-letter words
[Table 9]

[0054]
After adding multi-letter words and single-letter words to the word list and generating vocabulary records for these words, the facility assigns probabilities to the vocabulary records. This is used by the parser in ordering the application of lexical records in the analysis process. FIGS. 10 and 11 discussed below illustrate two alternative approaches that the facility uses to assign probabilities to vocabulary records.
[0055]
FIG. 10 is a flow diagram illustrating the steps that the facility preferably performs to assign probabilities to vocabulary records generated from words in the word list according to the first technique. The facility ultimately has a probability for each vocabulary record, either a high probability value that causes the parser to consider vocabulary records early in the analysis process, or a low probability value that causes the parser to consider vocabulary records later in the analysis process. It is preferable to set to. In steps 1001 to 1005, the facility repeats the loop for each word in the word list. In step 1002, if the current word is included in a larger word in the word list, the facility proceeds to step 1004, otherwise the facility proceeds to step 1003. In step 1003, the facility sets the probability of the vocabulary record representing this word to a high probability value. After step 1003, the facility proceeds to step 1005. In step 1004, the facility sets the probability of the vocabulary record representing that word to a low probability value. After step 1004, the facility proceeds to step 1005. In step 1005, if there are more words to be processed in the word list, the facility proceeds to step 1001 to process the next word in the word list. Otherwise, these steps are finished.
[0056]
Table 10 below shows the probability values assigned to each word in the word list according to the steps shown in FIG. By examining the probabilities, the facility assigns a high probability value to at least one word containing each letter, and considers at least one vocabulary record containing each letter early in the analysis process. I understand.
[0057]
Table 10: Word list with probability added
[Table 10]

[0058]
FIG. 11 is a flow diagram illustrating the steps that the facility preferably performs to assign probabilities to vocabulary records generated from words in the word list according to the second technique. In step 1101, the facility uses the word list to identify all possible divisions in a sentence that is entirely composed of words in the word list. In step 1102, the facility selects one or more categories identified in step 1101 that contain the least number of words. If there are more than one possible category with the fewest words, the facility selects each such possible category.
[0059]
Table 11 below shows possible categories with the fewest words (9) generated from the word list shown in Table 9.
[0060]
[Table 11]

[0061]
In step 1103, the facility sets the probability of word vocabulary records in the selected category (s) to a high probability value. In step 1104, the facility sets the probability of vocabulary records for words not in the selected category (s) to a low probability value. After step 1104, these steps are finished.
[0062]
Table 12 below shows the probability values assigned to each word in the word list according to the steps shown in FIG. By examining the probabilities, the facility assigns a high probability value to at least one word containing each letter, and considers at least one vocabulary record containing each letter early in the analysis process. I understand.
[0063]
Table 12: Word list with probability added
[Table 12]

[0064]
FIG. 12 is an analysis tree diagram showing an analysis tree representing the syntax structure of the sample sentence generated by the parser. It can be seen that the parse tree has a hierarchical structure having a single sentence record 1231 as its head and a large number of vocabulary records 1201 to 1211 as its leaves. In addition, the parse tree also has intermediate syntax records 1221-1227 that combine each vocabulary record representing a word into a larger syntax structure representing one or more words. For example, the preposition phrase record 1223 combines a vocabulary record 1204 representing a preposition and a vocabulary record 1206 representing a noun. In accordance with step 506 of FIG. 5, the facility identifies the word represented by the vocabulary records 1201-1121 in the parse tree as the word into which the sample sentence should be segmented. The facility may have this parse tree to perform further natural language processing on the sentence.
[0065]
While the invention has been shown and described with reference to preferred embodiments, those skilled in the art will recognize that various changes and modifications can be made in form and detail without departing from the scope of the invention. I will be. For example, in the languages other than Chinese, the facility features described above can be applied to classify words. In addition, a suitable subset or superset of the techniques described herein can also be applied to perform word segmentation.
[Brief description of the drawings]
FIG. 1 is a high-level block diagram of a general purpose computer system that is preferably executed by a facility.
FIG. 2 is a schematic flow diagram showing two phases in which the facility preferably operates.
FIG. 3 is a flow diagram illustrating steps that the facility preferably performs to augment the vocabulary knowledge base and include information used in performing word segmentation in the initialization phase.
FIG. 4 is a flow diagram illustrating the steps that are preferably performed to determine whether a particular word may include other smaller words.
FIG. 5 is a flow diagram of the steps preferably performed by the facility to divide a sentence into its constituent words.
FIG. 6 is a flow diagram illustrating the steps that the facility preferably performs to add a multi-character word to the word list.
FIG. 7 is a flow diagram illustrating the steps that the facility preferably performs to check the NexChar and CharPos conditions for word candidates.
FIG. 8 is a flow diagram illustrating the steps that the facility preferably performs to determine whether the last character of the current word candidate overlaps with another word candidate that may be a word.
FIG. 9 is a flow diagram illustrating the steps that the facility preferably performs to add a single character word to the word list.
FIG. 10 is a flow diagram illustrating steps that the facility preferably performs to assign probabilities to vocabulary records generated from words in the word list according to the first technique.
FIG. 11 is a flow diagram illustrating steps that the facility preferably performs to assign probabilities to vocabulary records generated from words in the word list according to the second technique.
FIG. 12 is an analysis tree diagram showing an analysis tree generated by a parser, which represents a syntax structure of a sample sentence.

Claims

Computer system
Word classification software facility means for performing word classification of sentences for non-segmented language sentences, and
A computer-readable medium storing a program for functioning as vocabulary knowledge base means for storing information for identifying a word appearing in a natural language sentence, which is implemented when the program is executed by the computer system A natural language character string using a NextChars attribute indicating a character appearing at a second position of a plurality of words starting with the character and a CharPos attribute indicating an indication of a position where the character appears in the word. A method of identifying a combination of characters that may be a word for each of a plurality of consecutive combinations of characters appearing in the natural language string,
Whether the character appearing in the second position of the combination appears in a word beginning with the character appearing in the first position of the combination is indicated in the NextChars attribute stored in the vocabulary knowledge base means whether the word A determination procedure (701) determined by the classification software facility means;
When the word segmentation software facility means determines that the character appearing in the second position of the combination appears in a word beginning with the character appearing in the first position of the combination as indicated by the NextChars attribute, the combination A determination procedure for determining whether the character segment software facility means determines whether or not the character Pos attribute stored in the vocabulary knowledge base means indicates that each of the characters appears in a word at the position where it appears in the combination ( 704),
If the word segmentation software facility means determines that the CharPos attribute indicates that each character of the combination appears in a word at a position that appears in the combination, the lexical knowledge base means stores the A computer-readable medium comprising: a determination procedure (605) for determining whether the combination of characters is a word or not by using the information for identifying the word, the word classification software facility means; .

Computer system
Records a program for functioning as a word segmentation software facility means for classifying a word of a sentence with respect to a non-segmented language sentence and a vocabulary knowledge base means for storing information for identifying a word appearing in a natural language sentence A computer memory in which data having a word segment data structure used for identifying individual words appearing in a natural language sentence is recorded in the vocabulary knowledge base means ,
The word segment data structure is
For each of multiple characters
A NextChars attribute used by the word segmentation software facility means to identify a character that appears in the second position of a word beginning with said character;
For words containing the letters,
A CharPos attribute used by a word segmentation software facility means to identify the length of the word and the character position of the word occupied by the character;
For each of the words
Computer memory; and a IgnoreParts attributes used to the character string constituting the word indicating whether a word segment software facility means constituting a series of short words.