JP3917880B2

JP3917880B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP3917880B2
Application number: JP2002069388A
Authority: JP
Inventors: 和正本田; 彰鶴田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-03-14
Filing date: 2002-03-14
Publication date: 2007-05-23
Anticipated expiration: 2022-03-14
Also published as: JP2003271187A

Description

【０００１】
【発明の属する技術分野】
本発明は、発声された音声を認識する音声認識装置に関するものである。
【０００２】
【従来の技術】
音声認識技術はここ数年で広く利用されるようになってきている。近年の計算機技術の発展に伴い、単語音声認識だけでなく連続音声認識もＰＣ上のソフトウェアなどで利用可能になっている。
【０００３】
音声認識においては、認識辞書に記憶されていない単語（未知語）をユーザが発声する可能性がある。連続音声認識における未知語への対応手法としては、“連続音声認識における未知語の扱い”（信学技報、ＳＰ９１−９６、Ｄｅｃ。１９９１）で述べられているように音韻タイプライタを利用するものがある。音韻タイプライタは、日本語としてありうる発声すべてを認識できるように声の特徴をサブワードでモデル化したものであり、そのサブワードには音素がよく用いられる。
【０００４】
以下、従来の、音韻タイプライタを用いた未知語処理について説明する。図３は従来の、音韻タイプライタを用いた未知語処理の一例を表したブロック図である。
【０００５】
音声認識装置３００に入力された話者の音声は、入力部３０１に入力され、ディジタル信号化される。ディジタル信号化された音声波形は、音響分析部３０２に入力され、分析される。分析方法としては、２０〜４０ｍｓｅｃの区間ごとに、比較的短時間の時間窓をかけて、８〜１６ｍｓｅｃごとに窓をシフトしていく短時間スペクトル分析の手法が使われることが多い。時間窓によって切り出された音声波形は、切り出された時間長を持つフレームと呼ばれる単位ごとの特徴ベクトルの時系列に変換される。特徴ベクトルは、その時刻における音声スペクトルの特徴量を抽出したもので、通常１０〜５０次元であり、メル周波数ケプストラム係数などが広く用いられている。変換した特徴ベクトルを認識部３０３へ出力する。
【０００６】
音響モデル３０７には、認識単位ごとに用意されたＨＭＭ（隠れマルコフモデル）が広く用いられており、認識単位としては音素片が用いられることが多い。ＨＭＭとは複数個の状態を持つ非決定性確率有限オートマトンであり、非定常信号源を定常信号源の連結で表す統計的信号源モデルとして用いられている。ＨＭＭは、遷移する状態の集まりとして表され、状態の遷移の確率を表す遷移確率と、状態が遷移するときに観測ベクトルの確率を出力する出力確率とからなる。音声認識に用いられる、ＨＭＭで表現された音響モデルは、音韻の性質をモデル化している。このＨＭＭでは、状態は、おおよそ音韻のイベント（閉鎖、破裂、摩擦、定常母音など、安定な区間）に対応する。出力確率は、遷移に伴って出力される信号の揺らぎの確率である。認識辞書の語彙に含まれる各単語について、認識単位（音素片等）それぞれに対応する状態の出力によって構成される系列（以下ＨＭＭの出力系列と表記する）が、入力信号の系列（それぞれ特徴ベクトルの時系列であることが多い）と一致する確率を出力確率と遷移確率から求め、その値が最大となる単語を認識結果とすることで、音声認識が実現される。この場合、入力信号の系列は、すでに起きた事象の観測データとして得られており、そのデータと比較して、ＨＭＭの出力系列がどれだけもっともらしいか、ということを求めている。すでに起きている事象（入力信号の系列）を説明する仮説（ＨＭＭの出力系列）の正しさを求めるために、確率ではなく尤度という概念を用いて分析が行われる。ＨＭＭの状態系列の尤度を計算するときは、尤度の積の代わりに対数尤度の和を求めることが多い。こういった状態系列を求めるアルゴリズムとしてＶｉｔｅｒｂｉアルゴリズムが広く用いられている。連続音声認識の場合は、言語モデルによって文章中の各単語のならびに文法的な制限が設けられている。
【０００７】
出力確率、遷移確率などのパラメータは、対応する学習音声を与えてバウム−ウェルチアルゴリズムと呼ばれるアルゴリズムなどであらかじめ学習されている。以下は、認識単位が音素であるＨＭＭが音響モデル３０７に記憶されているとする。
【０００８】
認識辞書３０４には、認識可能な語彙の情報が記憶されている。単語の表記、音素記号列が記憶されている。
【０００９】
言語モデル３０５には、認識辞書３０４に含まれる語彙に基づいたｎ−ｇｒａｍモデルが広く用いられている。ｎ−ｇｒａｍモデルとは、サンプルデータから統計的な手法によって確率推定を行う統計的言語モデルの一種であり、ｎ−ｇｒａｍモデルを用いた言語モデルの実装について、「音声認識システム」（オーム社出版局）に詳しく記載されている。
【００１０】
音韻連鎖モデル３０６には、日本語としてありうる音韻連鎖の規則が記憶されている。音韻タイプライタを利用した未知語処理においては、通常の連続音声認識とともに、未知語検出用の音韻認識が並列に行われる。音韻連鎖モデル３０６の実装としては、認識部３０３の実装を共通化するために、上記言語モデル３０５と同じデータ構造で記憶される事が多い。
【００１１】
認識部３０３では、音響分析部３０２の出力、音響モデル３０７、言語モデル３０５、音韻連鎖モデル３０６のそれぞれの情報を用いて音声認識処理を行う。通常の連続認識処理については、上記「音声認識システム」（オーム社出版局）にその実現方法が書かれている。ただし、ここでは認識処理において、探索は１パスフレーム同期ビームサーチで行われるとする。このとき、各フレームで音韻タイプライタを用いた音韻認識も並列に行われる。図４は認識部３０３における各フレームでの処理をフローチャートで表したものである。以下、図４にしたがってｉ番目のフレームでの認識処理について説明する。
【００１２】
まず、ステップ４０１で、対応するフレームにおいて、通常の認識処理をおこなう。具体的には、音響分析部３０２の出力した特徴ベクトルから、音響モデル３０７と言語モデル３０５を用いて各仮説の尤度を計算し、尤度の低い仮説を評価の対象からはずす（枝刈り）処理を行う。尤度の計算方法、枝刈りについても上記「音声認識システム」（オーム社出版局）に記載されている。さらに、各仮説の累積尤度を記憶しておく。例えば、フレームｉに対応する累積尤度を配列Ｐ１［ｉ］として、Ｐ１を仮説ごと別々に記憶する。
【００１３】
次に、ステップ４０２で、言語モデル３０５の代わりに音韻連鎖モデル３０６を用いてステップ４０１と同様の処理を行う。ただし、ステップ４０１での各仮説中の最大尤度を枝刈りするスコアの閾値とする。なぜなら、言語モデル３０５を用いた認識の場合は言語的制約があるため、日本語としてありうる音韻連鎖の可能性がある音韻連鎖モデル３０６を用いた認識における尤度のほうが最大値は高くなり、さらに未知語検出のためには音韻連鎖モデルを用いた認識の仮説はひとつ残っていればよいためである。このとき、対応するフレームにおける音韻連鎖モデルを用いた場合の最大の累積尤度を配列Ｐ２［ｉ］に記憶しておく。
【００１４】
すべてのフレームについて上記の処理が終了したら、もっともスコアの高い仮説に対応する文の各単語について、それぞれの単語における最後のフレーム（ここではｊ番目とする）の、音韻連鎖モデルを用いた分析での最大累積尤度に上記音韻連鎖モデル重み記憶部に記憶された値と言語モデル３０５を用いた分析での累積尤度の差を求め、それをＳとして、各単語で記憶する。つまり、上記音韻連鎖モデル重み記憶部に記憶された値をαとすると、α×Ｐ２［ｊ］―Ｐ１［ｊ］＝Ｓである。この値を各単語について求める。最後に、最もスコアの高い仮説に対応する文の各単語の表記と、Ｓの値を未知語区間検出部３０９に出力する。
【００１５】
未知語区間検出部３０９は、認識部３０３の出力を入力とする。入力された各単語におけるＳが０より大きい単語を未知語として、その単語の表記を「未知語」と変換した文字列を出力部３１０に出力する。ここで、音韻連鎖モデル重み記憶部３０８にはあらかじめ騒音などの発話環境を考慮して適当に決定された音韻連鎖モデル重みが記憶されている。
【００１６】
出力部３１０は、ディスプレイなど、文字列を出力できる装置であり、未知語区間検出部から出力された文字列を出力する。
【００１７】
上記の音韻連鎖モデル重みは、話者や発声環境によって大きく変化するため、精度よく未知語の検出を行うためには、それらの環境によって音韻連鎖モデル重みを変化させる必要がある。環境の違いによらない未知語の検出としては、特開平４−２５５９００号報に見られる手法を用いることで、音韻タイプライタによる分析と言語制約をもった分析での尤度から適切な音韻連鎖モデル重みを求めるというものがある。
【００１８】
【発明が解決しようとする課題】
しかしながら、上記従来の技術で述べた方法では、適切な音韻連鎖モデル重みが環境によって異なるという問題がある。それを解決するための特開平４−２５５９００号公報で見られる手法においては、発声最後まで分析しなければ適切な音韻連鎖モデル重みがわからないという欠点があり、フレーム同期探索時などに未知語であるかどうかの情報がまったくわからない。ｎ−ｇｒａｍなどの統計的言語モデルを用いた音声認識では、未知語が発声された後の認識結果はあまり信用できず、音韻タイプライタとの差が大きい部分だけを未知語と認識しても、その後の部分に正しい単語が結果として選ばれていない場合が多いので、未知語以外の部分の認識に問題がある。
【００１９】
そこで本発明の目的は、精度よく未知語の検出を行い、未知語以外の部分の認識にも頑健な未知語処理を実現できる音声認識装置を提供することにある。
【００２０】
【課題を解決するための手段】
本発明は、単語を記憶した認識辞書と、上記認識辞書に記憶された単語をあらかじめ学習した言語モデルと、認識対象言語においてありうる音韻連鎖の規則を記憶した音韻連鎖モデルとを有する音声認識装置において、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析を行う認識部と、上記音韻連鎖モデルを用いた分析結果に重みをかける値を記憶する音韻連鎖モデル重み記憶部と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析の結果を利用して上記音韻連鎖モデル重み記憶部に記憶された値を変更する音韻連鎖モデル重み変更部と、を備えたことを特徴とする。
【００２１】
また、本発明は、上記言語モデルの重みを記憶する言語モデル重み記憶部と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析との結果から上記言語モデル重み記憶部に記憶された重みを変更する言語モデル重み変更部と、を備えたことを特徴とする。
【００２２】
また、本発明は、上記言語モデル重み変更部は、上記言語モデルの重みを上記認識部における上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析で得られる尤度の差の値あるいは上記言語モデル重み記憶部に記憶されている値の関数によって得られる値に変更することを特徴とする。
【００２３】
また、本発明は、上記音韻連鎖モデル重み記憶部は、上記音韻連鎖モデル重みを上記認識部における上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析で得られる尤度の差の値あるいは上記音韻連鎖モデル重み記憶部に記憶された値の関数によって得られる値に変更することを特徴とする。
【００２４】
また、本発明は、上記認識部は、尤度計算時の探索において複数パスの探索を行い、二つ目以降のパスでは、上記言語モデル重み変更部は上記言語モデルの重み記憶部に記憶されている値を維持することを特徴とする。
【００２５】
また、本発明は、上記認識部は、尤度計算時の探索において複数パスの探索を行い、二つ目以降のパスでは、上記音韻連鎖モデル重み変更部は上記音韻連鎖モデル重み記憶部に記憶されている値を維持することを特徴とする。
【００２６】
また、本発明は、上記言語モデル重み記憶部は、発声前の値を別に記憶し、上記言語モデル重み変更部は、発声が終了したと判断されたときに、上記言語モデル重み記憶部に記憶された値を発声前の値に変更することを特徴とする。
【００２７】
また、本発明は、上記音韻連鎖モデル重み記憶部は、発声前の値を別に記憶し、上記音韻連鎖モデル重み変更部は、発声が終了したと判断されたときに、上記音韻連鎖モデル重み記憶部に記憶された値を発声前の値に変更することを特徴とする。
【００２８】
また、本発明は、上記言語モデル重み変更部は、無音部分と判断される処理単位においては、上記言語モデル重み記憶部に記憶された値を維持することを特徴とする。
【００２９】
また、本発明は、上記音韻連鎖モデル重み変更部は、無音部分と判断される処理単位においては、上記音韻連鎖モデル重み記憶部に記憶された値を維持することを特徴とする。
【００３０】
また、本発明は、単語を記憶した認識辞書と、上記認識辞書に記憶された各単語をあらかじめ学習した言語モデルと、認識対象言語においてありうる音韻連鎖の規則を記憶した音韻連鎖モデルとを用いる音声認識方法において、時間で分割された処理単位ごとに、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析を行い、それぞれの分析における最大尤度の値を出力する認識手段と、上記音韻連鎖モデルを用いた分析に重みをかける値を記憶する音韻連鎖モデル重み記憶手段と、上記処理単位ごとに、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析のそれぞれの最大尤度の値を用いて上記音韻連鎖モデル重み記憶手段で記憶した値を変更する音韻連鎖モデル重み変更手段と、を備えたことを特徴とする。
【００３１】
また、本発明は、上記言語モデルの重みを記憶する言語モデル重み記憶手段と、上記処理単位ごとに、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析のそれぞれの最大尤度の値を用いて上記言語モデル重み記憶手段で記憶した重みを変更する言語モデル重み変更手段と、を備えたことを特徴とする。
【００３２】
また、本発明は、単語を記憶した認識辞書と、上記認識辞書に記憶された各単語をあらかじめ学習した言語モデルと、認識対象言語においてありうる音韻連鎖の規則を記憶した音韻連鎖モデルとを有する音声認識方法において、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析を行う認識手段と、上記音韻連鎖モデルを用いた分析結果に重みをかける値を記憶する音韻連鎖モデル重み記憶手段と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析の結果を利用して上記音韻連鎖モデル重み記憶手段によって記憶された値を変更する音韻連鎖モデル重み変更手段と、を備えたことを特徴とする。
【００３３】
また、本発明は、上記言語モデルの重みを記憶する言語モデル重み記憶手段と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析の結果から上記言語モデル重み記憶手段によって記憶された重みを変更する言語モデル重み変更手段と、を備えたことを特徴とする。
【００３４】
また、本発明は、単語の情報を記憶した認識辞書と、上記認識辞書に記憶された各単語をあらかじめ学習した言語モデルと、認識対象言語においてありうる音韻連鎖の規則を記憶した音韻連鎖モデルとを用いる音声認識プログラムであって、コンピュータを、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析を行う認識手段と、上記音韻連鎖モデルを用いた分析結果に重みをかける値を記憶する音韻連鎖モデル重み記憶手段と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析の結果を利用して上記音韻連鎖モデル重み記憶手段によって記憶された値を変更する音韻連鎖モデル重み変更手段として機能させるための音声認識プログラムを提供する。
【００３５】
また、本発明は、上記音声認識プログラムにおいて、コンピュータを、上記言語モデルの重みを記憶する言語モデル重み記憶手段と、上記言語モデルを用いた分析と上記音韻連鎖モデルを用いた分析の結果から上記言語モデル重み記憶手段によって記憶された重みを変更する言語モデル重み変更手段として機能させるための音声認識プログラムを提供する。
【００３６】
【発明の実施の形態】
以下、本発明を実施例に基づき詳細に説明する。
【００３７】
図１は本発明による音声認識装置の一例を表すブロック図である。この音声認識装置１００は、入力部１０１、音響分析部１０２、認識部１０３、音響モデル１０７、認識辞書１０４、言語モデル１０５、音韻連鎖モデル１０６、音韻連鎖モデル重み変更部１１０、音韻連鎖モデル重み記憶部１１１、言語モデル重み変更部１０８、言語モデル重み記憶部１０９、未知語区間検出部１１２、出力部１１３で構成される。
【００３８】
入力部１０１は、入力された音声をディジタル信号化する。
【００３９】
音響分析部１０２は、入力されたディジタル信号を特徴ベクトルの時系列に変換する。
【００４０】
音響モデル１０７は、音素片ごとに用意されたＨＭＭを用いて発声の音韻的特徴が記憶されている。認識部１０３での２パス探索に対応するために、精度が低いがより高速な認識を行うことのできるモデルと、精度がより高いが低速で認識するモデルとの２つのモデルをそれぞれ記憶している。
【００４１】
認識辞書１０４には、認識可能な語彙の情報として、単語の表記を表す文字列と音素記号列を表す文字列が記憶されている。
【００４２】
言語モデル１０５には、ｎ−ｇｒａｍモデルによる統計的な言語情報が記憶されている。
【００４３】
音韻連鎖モデル１０６には、日本語としてありうる音韻連鎖の規則が、言語モデルと同様のデータ構造で記憶されている。
【００４４】
音韻連鎖モデル重み記憶部１１１には、認識部における認識で使用する音韻連鎖モデル重みを記憶しており、初期値としては１が記憶されている。音韻連鎖モデル重みの使い方については後述する。
【００４５】
言語モデル重み記憶部１０９には、認識部における認識処理で使用する言語モデルの重みを記憶しており、初期値としては１が記憶されている。
【００４６】
認識部１０３では、音響分析部１０２の出力、音響モデル１０７、言語モデル１０５、音韻連鎖モデル１０６のそれぞれの情報を用いて音声認識処理を行う。ここでは、認識部１０３は２パスで認識処理を行うとし、どちらのパスでもフレーム同期ビームサーチを行うとするが、１パス目では単純な音響モデルで高速化を図りつつ候補を絞り、２パス目で高精度な音響モデルを用いて精度の高い認識を行う。図２は認識部における１パス目でのｉ番目のフレームの処理を表したフローチャートである。以下、図２にしたがってｉ番目のフレームでの認識処理について説明する。
【００４７】
ステップ２０１において、対応するフレームにおいて、音響モデル１０７と言語モデル１０５を用いて従来の技術で述べた方法と同じように通常の認識処理を行う。ここでも、フレームｉに対応する累積尤度を配列Ｐ１［ｉ］として、尤度Ｐ１を仮説ごとに記憶する。
【００４８】
次に、ステップ２０２で、言語モデル１０５の代わりに音韻連鎖モデル１０６を用いてステップ２０１と同様の処理を行う。このときステップ２０１の各仮説の最大尤度を枝刈りするスコアの閾値とする。同様に、対応するフレームにおける音韻連鎖モデルを用いた場合の最大の累積尤度を配列Ｐ２［ｉ］として記憶しておく。
【００４９】
ステップ２０３において、無音部分かどうかの判定を行う。音響モデル１０７において、無音に対応するＨＭＭの状態が最も尤度が高いときに無音であると判断する。無音だった場合はそのまま次へ進み、無音でなかった場合はステップ２０４に進む。無音部分においては、言語モデルを用いた認識において尤度の差は出ないため、言語モデル重みの値や音韻連鎖モデル重みを変更しても期待した効果は得られないためである。音韻連鎖モデル重み変更部に変更の通知と、配列Ｐ２［ｉ］の値と配列Ｐ１［ｉ］の値を出力する。
【００５０】
ステップ２０５において、言語モデル重み変更部に値を変更する通知を送るかどうかの判定を行う。ここで、音韻連鎖モデル重み記憶部に記憶されている値をαとすると、α×Ｐ２［ｉ］―Ｐ１［ｉ］が正の値であるときは、ステップ２０６において、配列Ｐ２［ｉ］の値と配列Ｐ１［ｉ］の値を、変更の通知とともに言語モデル重み変更部および音韻連鎖モデル重み記憶部に出力する。そうでなければ次のフレームに移る。
【００５１】
２パス目では高精度な音響モデルを用いる以外はほぼ同様の処理を行うのであるが、２パス目では言語モデル重み変更部、音韻連鎖モデル重み変更部への変更通知を行わず、言語モデル重み変更部、音韻連鎖モデル重み変更部は前回の値を維持する。１パス目ですべての入力フレームでの音韻連鎖モデルを用いた分析が終わっているので、その発声すべての情報を用いることができるからである。
【００５２】
また、今回の発声によって変更された言語モデル重みの値および音韻連鎖モデル重みは、必ずしも発声前の値に戻さなくてよい。携帯電話など、発声時の騒音などの環境が特定されない機器へ応用する場合は、発声前の値に戻すことで次回の発生で環境が大きく変わっている場合にも、今回の値を記憶したことによる悪影響を防ぐことができる。逆に、家庭用コンピュータなど、発声時の環境がほぼ一定である機器へ応用する場合は、発声前の状態に戻さず今回の発声によって変更された値を記憶しつづけることで、次回の発声においても適切な値で認識処理を行うことが可能となる。
【００５３】
すべてのフレームにおいて上記の処理を終え、２パス目の処理も終えたら、未知語区間検出部１１２に、尤度が最大の仮説と仮説の各単語の累積尤度の差Ｓを出力する。
【００５４】
音韻連鎖モデル重み変更部１１０では、認識部１０３からの通知を受け取り、音韻連鎖モデル重み記憶部に記憶されている値を変更する。ここで、音韻連鎖モデルを用いた分析においては、認識辞書や言語モデルによる言語的な制約を受けないため、得られる尤度が大きくなる。音韻連鎖モデルを用いた分析と言語モデルを用いた分析の尤度を比較する際に、発声が未知語でない部分でも、音韻連鎖モデルを用いた分析で得られた尤度が大きくなってしまうことが多いため、重みを示す係数（ペナルティ）は必ず１より小さい値に設定される。
【００５５】
変更は、音韻連鎖モデル重み記憶部に記憶されている値をαとすると、α＝（Ｐ２［ｉ］―Ｐ１［ｉ］）／ｉとする。これは、環境の違いによる配列Ｐ２［ｉ］と配列Ｐ１［ｉ］の尤度の差は、音響モデル１０７と発声された音声の特徴の違いもあって大きくなることが多いからである。
【００５６】
言語モデル重み変更部１０８は、認識部１０３からの通知にしたがって言語モデル重み記憶部に記憶されている値を変更する。変更は、言語モデル重み記憶部に記憶されている値をβとすると、β＝β×（（Ｐ２［ｉ］―Ｐ１［ｉ］）／Ｐ２［ｉ］）とする。これは、配列Ｐ２［ｉ］と配列Ｐ１［ｉ］の値の差が大きいほどそのあとの言語モデルの数値が信用できなくなるためである。
【００５７】
未知語区間検出部１１２では、認識部１０３の出力を入力とし、入力された各単語における累積尤度の差Ｓが０より大きい単語を未知語として、その単語の表記を「未知語」と変換した文字列を出力部１１３に出力する。
【００５８】
出力部１１３は、ディスプレイなどの文字列を出力できる装置が使用され、認識部１０３から出力された文字列を出力する。
（実施例１）
以下、「自由が丘に行く」という文章が発声された場合の具体的な処理動作を示す。このとき、「自由が丘」という単語は認識辞書１０４に含まれていないものとする。
【００５９】
まず、入力された音声は入力部１０１によってディジタル信号化される。ディジタル信号化された音声は短時間スペクトル分析の手法でフレーム単位に分割され、各フレームでの音声は音響分析部１０２によって特徴ベクトルの時系列に変換され、認識部１０３に出力される。
【００６０】
認識部１０３では、各フレームにおいて図２のフローチャートに従って処理が行われる。
【００６１】
まずステップ２０１で、言語モデル１０５と音響モデル１０７を用いて尤度Ｐ１の計算を行う。
【００６２】
つぎにステップ２０２で、音韻連鎖モデル１０６と音響モデル１０７を用いて尤度Ｐ２の計算を行う。
ここでは、「自由が丘」という単語は認識辞書１０４に含まれていないため、尤度Ｐ１が最大となるのは認識辞書１０４に含まれる別の単語、例えば「自営業」などとなる。「自由が丘」と「自営業」では音素の並びが異なるために、「自営業」の尤度Ｐ１の値は認識辞書１０４に含まれる単語の中では高いものの、認識語彙による音素の並びの制約をうけない音韻連鎖モデル１０６を用いて計算した尤度Ｐ２よりは低くなると考えられる。
【００６３】
次に、ステップ２０３において無音部分かどうかの判定を行い、無音だった場合には、そのまま次のフレームの処理へ進み、無音でなかった場合はステップ２０４へ進む。
【００６４】
ステップ２０４においては、音韻連鎖モデル重み変更部に重みの変更通知と、尤度Ｐ１および尤度Ｐ２の値を出力する。適切な音韻連鎖モデル重みは、騒音や残響などの周囲の環境によって大きく変わるため、それまでの分析で得られた値をもとに適切な音韻連鎖モデル重みを設定するためである。
【００６５】
次に、ステップ２０５で、言語モデルの重みを変更するかどうかの判定を行う。α×Ｐ２［ｉ］―Ｐ１［ｉ］が正の値であればステップ２０５で言語モデルの重みを変更し、そうでなければなにもせずに次のフレームの処理に移る。２パス目でも高精度な音響モデルを用いる以外はほぼ同様の処理を行うのであるが、２パス目では言語モデル重み変更部、音韻連鎖モデル重み変更部への変更通知を行わず、言語モデル重み変更部、音韻連鎖モデル重み変更部は前回の値を維持する。また、すべてのフレームでの処理が終了したら、言語モデル重みの値と音韻連鎖モデル重みを発声前の値に戻す。
【００６６】
言語モデル１０５にはｎ−ｇｒａｍモデルによる統計的な言語情報が記憶されているため、「自営業に行く」という日本語としてありえない単語の並びは、音響モデル１０７を用いて得られる尤度は高いものの、言語モデル１０５を用いて得られる尤度が低くなってしまい、例えば「自営業である」、「自営業を営む」などといった単語の並びの尤度のほうが高くなり、発声における「に行く」の部分で認識誤りが生じてしまう。しかしながら、ステップ２０５において言語モデルの重みを変更することで言語的な制約を緩くして、「自営業」のあとに、音響尤度の高い「に行く」といった単語の並びが続くことを可能にすることができる。すべてのフレームにおいて上記の処理を終え、２パス目の処理も終えたら、未知語区間検出部１１２に、尤度が最大の仮説（この例では「自営業に行く」）と仮説の各単語の累積尤度の差Ｓをそれぞれ出力する。言語モデルの重みを変更しなかった場合は、尤度が最大の仮説は「自営業である」などといった、日本語として正しい単語列になることが多い。
【００６７】
未知語区間検出部１１２では、認識部１０３の出力である、「自営業に行く」という仮説と、「自営業」、「に」、「行く」の各単語についての尤度の差Ｓのそれぞれを入力とする。ここでは、各単語での累積尤度の差Ｓが０より大きい単語を未知語と判定する。「自由が丘」と「自営業」の音素の並びの違いから、「自営業」では尤度Ｐ１と尤度Ｐ２の差が大きいので、「自営業」が未知語と判定できる。音韻連鎖モデル重みを変更しなかった場合は、発声環境により適切な音韻連鎖モデル重みの設定が困難なため、「自営業」を未知語と判定できなかったり、「に」や「行く」を未知語と判定してしまうことがある。未知語区間検出部１１２では、「自営業」の部分を「未知語」と変換して出力部１１３に出力し、出力部１１３はその結果をディスプレイなどの表示装置に出力する。
（実施例２）
以下さらに、本発明の音声認識方法を用いた音声認識装置を単語音声認識装置として使用した場合の例として、「自由が丘」という単語が発声された場合の動作を示す。図５は本発明の音声認識方法を用いて、単語認識を行う音声認識装置の例である。単語のみを対象とした認識処理の場合は、ｎ−ｇｒａｍによる統計的な言語情報は用いない点が異なっている。また、単語認識では先ほどの文章の認識でよく用いられている複数パスの認識は行わないことも多い。ここでは１パスで処理を行うとする。さらに、「自由が丘」という単語は認識辞書５０４に含まれないものとする。
【００６８】
まず、入力された音声は入力部５０１によってディジタル信号化される。ディジタル信号化された音声は短時間スペクトル分析の手法でフレーム単位に分割され、各フレームでの音声は音響分析部５０２によって特徴ベクトルの時系列に変換され、認識部５０３に出力される。
【００６９】
認識部５０３では、各フレームにおいて図６のフローチャートに従って処理が行われる。
【００７０】
まずステップ６０１で、音響モデル５０７を用いて尤度Ｐ１の計算を行う。
つぎにステップ６０２で、音韻連鎖モデル５０６と音響モデル５０７を用いて尤度Ｐ２の計算を行う。
ここでは、「自由が丘」という単語は認識辞書５０４に含まれないため、尤度Ｐ１が最大となるのは認識辞書５０４に含まれる別の単語、例えば「自営業」などとなる。「自由が丘」と「自営業」では音素の並びが異なるために、「自営業」の尤度Ｐ１の値は認識辞書５０４に含まれる単語の中では高いものの，認識語彙による音素の並びの制約をうけない音韻連鎖モデル５０６を用いて計算した尤度Ｐ２よりは低くなることが多い。
【００７１】
次に、ステップ６０３において無音部分かどうかの判定を行い、無音だった場合、そのまま次のフレームの処理へ進み、無音でなかった場合はステップ６０４に進む。
【００７２】
次に、ステップ６０４において音韻連鎖モデル重み変更部に重みの変更通知と、尤度Ｐ１および尤度Ｐ２の値を出力する。これは、適切な音韻連鎖モデル重みは、騒音や残響などの周囲の環境によって大きく変わるため、それまでの分析で得られた値をもとに適切な音韻連鎖モデル重みを設定するためである。すべてのフレームでの処理が終了したら、音韻連鎖モデル重みを発声前の値に戻す。
【００７３】
未知語区間検出部５１２では、認識部５０３の出力である「自営業」という候補と、「自営業」の累積尤度の差Ｓのそれぞれを入力とする。ここでは、Ｓが０より大きい単語を未知語と判定する。「自由が丘」と「自営業」の音素の並びの違いから、「自営業」では尤度Ｐ１と尤度Ｐ２の差が大きいので、「自営業」が未知語と判定できる。音韻連鎖モデル重みを変更しなかった場合は、発声環境により適切な音韻連鎖モデル重みの設定が困難なため「自営業」を未知語と判定できないことがある。未知語区間検出部５１２では、「自営業」の部分を「未知語」と変換して出力部５１３に出力し、出力部５１３はその結果をディスプレイなどの表示装置に出力する。
【００７４】
【発明の効果】
以上のように、本発明によれば、音声認識を行う際に、精度よく未知語の検出を行うことができる。また、未知語以外の部分の認識にも頑健な未知語処理を実現することができる。
【図面の簡単な説明】
【図１】本発明による音声認識装置の一例を示すブロック図である。
【図２】本発明による認識部の動作を示すフローチャートである。
【図３】従来の音声認識装置の一例を示すブロック図である。
【図４】従来の音声認識装置における認識部の動作を表すフローチャートである。
【図５】本発明による音声認識方法を用いて単語認識を行う音声認識装置の一例を示すブロック図である。
【図６】本発明による認識部の単語認識の際の動作を示すフローチャートである。
【符号の説明】
１００、３００，５００音声認識装置
１０１、３０１、５０１入力部
１０２、３０２、５０２音響分析部
１０３、３０３、５０３認識部
１０４、３０４、５０４認識辞書
１０５、３０５言語モデル
１０６、３０６、５０６音韻連鎖モデル
１０７、３０７、５０７音響モデル
１０８言語モデル重み変更部
１０９言語モデル重み記憶部
１１０、５１０音韻連鎖モデル重み変更部
１１１、３０８、５１１音韻連鎖モデル重み記憶部
１１２、３０９、５１２未知語区間検出部
１１３、３１０、５１３出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus for recognizing uttered speech.
[0002]
[Prior art]
Speech recognition technology has become widely used in recent years. With the development of computer technology in recent years, not only word speech recognition but also continuous speech recognition can be used with software on a PC.
[0003]
In speech recognition, a user may utter a word (unknown word) that is not stored in the recognition dictionary. As a method for dealing with unknown words in continuous speech recognition, a phonological typewriter is used as described in “Handling of Unknown Words in Continuous Speech Recognition” (Science Technical Report, SP91-96, Dec. 1991). There is something. The phonological typewriter is a model of voice characteristics using subwords so that all possible utterances in Japanese can be recognized, and phonemes are often used for the subwords.
[0004]
Hereinafter, the conventional unknown word processing using the phoneme typewriter will be described. FIG. 3 is a block diagram showing an example of conventional unknown word processing using a phoneme typewriter.
[0005]
The voice of the speaker input to the speech recognition device 300 is input to the input unit 301 and converted into a digital signal. The voice waveform converted into a digital signal is input to the acoustic analysis unit 302 and analyzed. As an analysis method, a short-time spectrum analysis method is often used in which a relatively short time window is taken every 20 to 40 msec and the window is shifted every 8 to 16 msec. The speech waveform cut out by the time window is converted into a time series of feature vectors for each unit called a frame having the cut out time length. The feature vector is obtained by extracting the feature amount of the voice spectrum at that time, and is usually 10 to 50 dimensions, and a mel frequency cepstrum coefficient or the like is widely used. The converted feature vector is output to the recognition unit 303.
[0006]
For the acoustic model 307, HMM (Hidden Markov Model) prepared for each recognition unit is widely used, and a phoneme piece is often used as the recognition unit. The HMM is a nondeterministic stochastic finite automaton having a plurality of states, and is used as a statistical signal source model in which an unsteady signal source is represented by a connection of stationary signal sources. The HMM is expressed as a collection of states that transition, and includes a transition probability that represents the probability of state transition and an output probability that outputs the probability of the observation vector when the state transitions. An acoustic model expressed in HMM used for speech recognition models the characteristics of phonemes. In this HMM, the state roughly corresponds to a phonological event (stable interval such as closing, bursting, friction, steady vowel). The output probability is the probability of fluctuation of the signal output with the transition. For each word included in the vocabulary of the recognition dictionary, a series (hereinafter referred to as an HMM output series) constituted by outputs in a state corresponding to each recognition unit (phoneme etc.) is an input signal series (each feature vector). Is obtained from the output probability and the transition probability, and the word having the maximum value is used as the recognition result, thereby realizing speech recognition. In this case, the sequence of the input signal is obtained as observation data of an already occurring event, and it is determined how much the output sequence of the HMM is plausible compared with that data. In order to determine the correctness of a hypothesis (HMM output sequence) that explains an already occurring event (input signal sequence), an analysis is performed using the concept of likelihood instead of probability. When calculating the likelihood of an HMM state sequence, the sum of log likelihoods is often obtained instead of the product of likelihoods. The Viterbi algorithm is widely used as an algorithm for obtaining such a state sequence. In the case of continuous speech recognition, there are grammatical restrictions on each word in the sentence according to the language model.
[0007]
Parameters such as output probability and transition probability are learned in advance by an algorithm called a Baum-Welch algorithm by giving corresponding learning speech. In the following, it is assumed that an HMM whose recognition unit is a phoneme is stored in the acoustic model 307.
[0008]
The recognition dictionary 304 stores information on recognizable vocabulary. Word notation and phoneme symbol string are stored.
[0009]
As the language model 305, an n-gram model based on the vocabulary included in the recognition dictionary 304 is widely used. The n-gram model is a kind of statistical language model that estimates probability from sample data by a statistical method. Regarding the implementation of the language model using the n-gram model, “speech recognition system” (Ohm Publishing Co., Ltd.) Bureau).
[0010]
The phonological chain model 306 stores phonological chain rules that may be in Japanese. In unknown word processing using a phoneme typewriter, phoneme recognition for unknown word detection is performed in parallel with normal continuous speech recognition. The phoneme chain model 306 is often stored in the same data structure as the language model 305 in order to make the recognition unit 303 common.
[0011]
The recognition unit 303 performs speech recognition processing using the information of the output of the acoustic analysis unit 302, the acoustic model 307, the language model 305, and the phonological chain model 306. As for normal continuous recognition processing, a method for realizing it is described in the above "voice recognition system" (Ohm Publishing Co.). However, here, in the recognition process, it is assumed that the search is performed by a one-pass frame synchronization beam search. At this time, phoneme recognition using a phoneme typewriter in each frame is also performed in parallel. FIG. 4 is a flowchart showing processing in each frame in the recognition unit 303. Hereinafter, the recognition process in the i-th frame will be described with reference to FIG.
[0012]
First, in step 401, normal recognition processing is performed on the corresponding frame. Specifically, the likelihood of each hypothesis is calculated from the feature vector output from the acoustic analysis unit 302 using the acoustic model 307 and the language model 305, and the hypothesis having a low likelihood is removed from the evaluation target (pruning). Process. The likelihood calculation method and pruning are also described in the above "speech recognition system" (Ohm Publishing House). Furthermore, the cumulative likelihood of each hypothesis is stored. For example, the cumulative likelihood corresponding to the frame i is set as an array P1 [i], and P1 is stored separately for each hypothesis.
[0013]
Next, in step 402, the same processing as in step 401 is performed using the phoneme chain model 306 instead of the language model 305. However, the maximum likelihood in each hypothesis in step 401 is set as a threshold value for the pruning score. Because there is a linguistic restriction in the case of recognition using the language model 305, the maximum value is higher in the likelihood in recognition using the phonological chain model 306, which may be a phonological chain that can be in Japanese. Furthermore, for unknown word detection, it is only necessary to have one recognition hypothesis using the phonological chain model. At this time, the maximum cumulative likelihood when the phoneme chain model in the corresponding frame is used is stored in the array P2 [i].
[0014]
When the above processing is completed for all the frames, for each word of the sentence corresponding to the hypothesis having the highest score, an analysis using the phonological chain model of the last frame in each word (here, j-th) is performed. A difference between the value stored in the phonological chain model weight storage unit and the cumulative likelihood in the analysis using the language model 305 is obtained as S, and stored as S in each word. That is, if the value stored in the phonological chain model weight storage unit is α, α × P2 [j] −P1 [j] = S. This value is obtained for each word. Finally, the notation of each word of the sentence corresponding to the hypothesis with the highest score and the value of S are output to the unknown word section detection unit 309.
[0015]
The unknown word section detection unit 309 receives the output of the recognition unit 303 as an input. A word string having S greater than 0 in each input word is regarded as an unknown word, and a character string obtained by converting the notation of the word to “unknown word” is output to the output unit 310. Here, the phoneme chain model weight storage unit 308 stores phoneme chain model weights appropriately determined in advance in consideration of the speech environment such as noise.
[0016]
The output unit 310 is a device that can output a character string, such as a display, and outputs the character string output from the unknown word section detection unit.
[0017]
Since the above phoneme chain model weight greatly varies depending on the speaker and the utterance environment, in order to detect unknown words with high accuracy, it is necessary to change the phoneme chain model weight depending on the environment. To detect unknown words that do not depend on the difference in environment, using the method found in Japanese Patent Laid-Open No. 4-255900, an appropriate phoneme chain can be determined from the likelihood of the analysis by the phonetic typewriter and the analysis with language restrictions. There is a method for obtaining model weights.
[0018]
[Problems to be solved by the invention]
However, the method described in the above prior art has a problem that the appropriate phoneme chain model weight varies depending on the environment. In the technique found in Japanese Patent Laid-Open No. 4-255900 for solving this problem, there is a disadvantage that an appropriate phonological chain model weight is not known unless it is analyzed to the end of the utterance. I do not know at all whether or not. In speech recognition using a statistical language model such as n-gram, the recognition result after an unknown word is uttered is not very reliable, and even if only a portion having a large difference from a phonological typewriter is recognized as an unknown word. In many cases, the correct word is not selected as a result in the subsequent part, so there is a problem in recognizing a part other than the unknown word.
[0019]
SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition apparatus capable of detecting unknown words with high accuracy and realizing unknown word processing that is robust to recognition of parts other than unknown words.
[0020]
[Means for Solving the Problems]
The present invention relates to a speech recognition apparatus having a recognition dictionary storing words, a language model in which the words stored in the recognition dictionary have been learned in advance, and a phoneme chain model storing phoneme chain rules that can be in the recognition target language. A recognition unit that performs an analysis using the language model and an analysis using the phonological chain model, a phonological chain model weight storage unit that stores a value for weighting the analysis result using the phonological chain model, A phoneme chain model weight changing unit that changes a value stored in the phoneme chain model weight storage unit using a result of an analysis using a language model and an analysis using the phoneme chain model. And
[0021]
The language model weight storage unit stores the weight of the language model, and the language model weight storage unit stores the result of the analysis using the language model and the analysis using the phonological chain model. A language model weight changing unit for changing the weights.
[0022]
In the present invention, the language model weight changing unit may calculate a difference between likelihoods obtained by analyzing the weight of the language model using the language model in the recognition unit and the analysis using the phonological chain model, or The value is obtained by a function of values stored in the language model weight storage unit.
[0023]
Further, according to the present invention, the phoneme chain model weight storage unit is a value of a difference in likelihood obtained by analyzing the phoneme chain model weight using the language model in the recognition unit and the analysis using the phoneme chain model. Or it changes to the value obtained by the function of the value memorize | stored in the said phoneme chain model weight memory | storage part.
[0024]
Further, according to the present invention, the recognition unit searches for a plurality of paths in the search at the time of likelihood calculation, and the language model weight change unit is stored in the weight storage unit of the language model in the second and subsequent paths. It is characterized by maintaining the value.
[0025]
Further, according to the present invention, the recognition unit searches for a plurality of paths in the search at the time of likelihood calculation, and the phoneme chain model weight changing unit stores the phoneme chain model weight storage unit in the second and subsequent paths. It is characterized in that the value being maintained is maintained.
[0026]
Further, according to the present invention, the language model weight storage unit separately stores a value before utterance, and the language model weight change unit stores the value in the language model weight storage unit when it is determined that the utterance is finished. The recorded value is changed to a value before utterance.
[0027]
Further, in the present invention, the phonological chain model weight storage unit separately stores a value before utterance, and the phonological chain model weight change unit stores the phonological chain model weight storage when it is determined that the utterance is finished. The value stored in the section is changed to a value before utterance.
[0028]
Further, the present invention is characterized in that the language model weight changing unit maintains a value stored in the language model weight storage unit in a processing unit determined to be a silent part.
[0029]
Further, the present invention is characterized in that the phonological chain model weight changing unit maintains a value stored in the phonological chain model weight storage unit in a processing unit determined to be a silent part.
[0030]
In addition, the present invention uses a recognition dictionary that stores words, a language model that learns each word stored in the recognition dictionary in advance, and a phoneme chain model that stores rules of phoneme chains that may be in the recognition target language. In the speech recognition method, for each processing unit divided by time, a recognition unit that performs analysis using the language model and analysis using the phonological chain model, and outputs a maximum likelihood value in each analysis; A phonological chain model weight storage means for storing a value for weighting the analysis using the phonological chain model, and a maximum of each of the analysis using the language model and the analysis using the phonological chain model for each processing unit. Phonological chain model weight changing means for changing the value stored in the phonological chain model weight storage means by using the likelihood value.
[0031]
Further, the present invention provides language model weight storage means for storing the weight of the language model, and the maximum likelihood of each of the analysis using the language model and the analysis using the phonological chain model for each processing unit. Language model weight changing means for changing the weight stored in the language model weight storage means by using a value.
[0032]
The present invention also includes a recognition dictionary that stores words, a language model that learns each word stored in the recognition dictionary in advance, and a phonological chain model that stores rules of phonological chains that may be in the recognition target language. In the speech recognition method, a recognition means for performing an analysis using the language model and an analysis using the phonological chain model, and a phonological chain model weight storage means for storing a value for weighting the analysis result using the phonological chain model And phonological chain model weight changing means for changing the value stored by the phonological chain model weight storage means using the results of the analysis using the language model and the analysis using the phonological chain model. It is characterized by that.
[0033]
Further, the present invention provides a language model weight storage means for storing the weight of the language model, and is stored by the language model weight storage means from the results of the analysis using the language model and the analysis using the phonological chain model. Language model weight changing means for changing the weight.
[0034]
The present invention also provides a recognition dictionary that stores word information, a language model that learns each word stored in the recognition dictionary in advance, and a phonological chain model that stores rules of phonological chains that can exist in the recognition target language. A speech recognition program that uses a computer to recognize a means for performing an analysis using the language model and an analysis using the phonological chain model, and to store a value that weights the analysis result using the phonological chain model Phoneme chain model weight storage means for performing, and using the results of the analysis using the language model and the analysis using the phoneme chain model, the phoneme chain model weight for changing the value stored by the phoneme chain model weight storage means A speech recognition program for functioning as a changing means is provided.
[0035]
According to the present invention, in the above speech recognition program, the computer is obtained from the results of the analysis using the language model weight storage means for storing the weight of the language model, the analysis using the language model, and the analysis using the phonological chain model. Provided is a speech recognition program for functioning as a language model weight changing means for changing a weight stored by a language model weight storage means.
[0036]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail based on examples.
[0037]
FIG. 1 is a block diagram showing an example of a speech recognition apparatus according to the present invention. The speech recognition apparatus 100 includes an input unit 101, an acoustic analysis unit 102, a recognition unit 103, an acoustic model 107, a recognition dictionary 104, a language model 105, a phonological chain model 106, a phonological chain model weight changing unit 110, and a phonological chain model weight storage. Unit 111, language model weight change unit 108, language model weight storage unit 109, unknown word section detection unit 112, and output unit 113.
[0038]
The input unit 101 converts the input voice into a digital signal.
[0039]
The acoustic analysis unit 102 converts the input digital signal into a feature vector time series.
[0040]
The acoustic model 107 stores phonological features of utterances using an HMM prepared for each phoneme unit. In order to support the two-pass search in the recognition unit 103, two models are stored: a model with low accuracy but capable of performing faster recognition, and a model with higher accuracy but recognized at low speed. Yes.
[0041]
The recognition dictionary 104 stores a character string representing a word notation and a character string representing a phoneme symbol string as recognizable vocabulary information.
[0042]
The language model 105 stores statistical language information based on the n-gram model.
[0043]
The phoneme chain model 106 stores phoneme chain rules that may be in Japanese with the same data structure as the language model.
[0044]
The phoneme chain model weight storage unit 111 stores phoneme chain model weights used for recognition in the recognition unit, and 1 is stored as an initial value. How to use the phoneme chain model weight will be described later.
[0045]
The language model weight storage unit 109 stores the weight of the language model used in the recognition process in the recognition unit, and 1 is stored as an initial value.
[0046]
The recognition unit 103 performs speech recognition processing using the information of the output of the acoustic analysis unit 102, the acoustic model 107, the language model 105, and the phonological chain model 106. Here, it is assumed that the recognition unit 103 performs recognition processing in two passes and performs frame-synchronized beam search in both passes. However, in the first pass, candidates are narrowed down while speeding up with a simple acoustic model, and two passes. High-accuracy recognition is performed using high-accuracy acoustic models with the eyes. FIG. 2 is a flowchart showing the processing of the i-th frame in the first pass in the recognition unit. Hereinafter, the recognition process in the i-th frame will be described with reference to FIG.
[0047]
In step 201, normal recognition processing is performed in the corresponding frame using the acoustic model 107 and the language model 105 in the same manner as the method described in the related art. Here, the cumulative likelihood corresponding to the frame i is set as the array P1 [i], and the likelihood P1 is stored for each hypothesis.
[0048]
Next, in step 202, the same processing as in step 201 is performed using the phonological chain model 106 instead of the language model 105. At this time, the maximum likelihood of each hypothesis in step 201 is set as a threshold value of the score for pruning. Similarly, the maximum cumulative likelihood when the phoneme chain model in the corresponding frame is used is stored as an array P2 [i].
[0049]
In step 203, it is determined whether it is a silent part. In the acoustic model 107, it is determined that there is silence when the HMM state corresponding to silence has the highest likelihood. If there is no sound, the process proceeds to the next step. If not, the process proceeds to step 204. This is because, in the silent part, there is no difference in likelihood in recognition using a language model, and therefore the expected effect cannot be obtained even if the value of the language model weight or the phoneme chain model weight is changed. The change notification, the value of the array P2 [i], and the value of the array P1 [i] are output to the phoneme chain model weight change unit.
[0050]
In step 205, it is determined whether or not a notification for changing the value is sent to the language model weight changing unit. Here, assuming that the value stored in the phoneme chain model weight storage unit is α, when α × P2 [i] −P1 [i] is a positive value, in step 206, the array P2 [i] The value and the value of the array P1 [i] are output to the language model weight change unit and the phonological chain model weight storage unit together with the change notification. Otherwise, go to the next frame.
[0051]
In the second pass, almost the same processing is performed except that a high-accuracy acoustic model is used. However, in the second pass, the language model weight is not notified to the language model weight change unit and the phonological chain model weight change unit. The changing unit and the phonological chain model weight changing unit maintain the previous values. This is because the analysis using the phonological chain model in all input frames is completed in the first pass, so that all information of the utterance can be used.
[0052]
Further, the language model weight value and the phonological chain model weight changed by the current utterance do not necessarily have to be returned to the values before the utterance. When applying to devices such as mobile phones where the environment such as noise during utterance is not specified, this value was memorized even if the environment changed significantly by the next occurrence by returning to the value before utterance Can prevent adverse effects. On the other hand, when applying to a device such as a home computer where the environment at the time of utterance is almost constant, the value changed by the current utterance is stored in the next utterance without returning to the state before utterance. Can be recognized with an appropriate value.
[0053]
When the above processing is completed for all frames and the second pass processing is also completed, the unknown word section detection unit 112 outputs a hypothesis having the maximum likelihood and a difference S between the cumulative likelihoods of each word of the hypothesis.
[0054]
The phoneme chain model weight changing unit 110 receives the notification from the recognition unit 103 and changes the value stored in the phoneme chain model weight storage unit. Here, in the analysis using the phoneme chain model, there is no linguistic restriction by the recognition dictionary or the language model, so that the likelihood obtained is increased. When comparing the likelihood of the analysis using the phonological chain model and the analysis using the language model, the likelihood obtained by the analysis using the phonological chain model becomes large even if the utterance is not an unknown word. Therefore, the coefficient indicating the weight (penalty) is always set to a value smaller than 1.
[0055]
The change is made α = (P2 [i] −P1 [i]) / i, where α is a value stored in the phoneme chain model weight storage unit. This is because the difference in likelihood between the array P2 [i] and the array P1 [i] due to the difference in environment often increases due to the difference in the characteristics of the acoustic model 107 and the spoken voice.
[0056]
The language model weight change unit 108 changes the value stored in the language model weight storage unit in accordance with the notification from the recognition unit 103. The change is β = β × ((P2 [i] −P1 [i]) / P2 [i]) where β is a value stored in the language model weight storage unit. This is because the larger the difference between the values of the array P2 [i] and the array P1 [i], the less reliable the numerical value of the language model after that.
[0057]
The unknown word section detection unit 112 receives the output of the recognition unit 103 as an input, converts a word having a cumulative likelihood difference S greater than 0 in each input word as an unknown word, and converts the word notation to “unknown word” The character string is output to the output unit 113.
[0058]
The output unit 113 uses a device that can output a character string, such as a display, and outputs the character string output from the recognition unit 103.
Example 1
Hereinafter, a specific processing operation when the sentence “go to Jiyugaoka” is uttered will be shown. At this time, the word “Jiyugaoka” is not included in the recognition dictionary 104.
[0059]
First, the input voice is converted into a digital signal by the input unit 101. The voice converted into a digital signal is divided into frames by a short-time spectrum analysis method, and the voice in each frame is converted into a time series of feature vectors by the acoustic analysis unit 102 and output to the recognition unit 103.
[0060]
In the recognition unit 103, processing is performed in each frame according to the flowchart of FIG.
[0061]
First, in step 201, the likelihood P1 is calculated using the language model 105 and the acoustic model 107.
[0062]
In step 202, the likelihood P2 is calculated using the phoneme chain model 106 and the acoustic model 107.
Here, since the word “Jiyugaoka” is not included in the recognition dictionary 104, the maximum likelihood P1 is another word included in the recognition dictionary 104, such as “self-employed”. Since “Jiyugaoka” and “self-employed” have different phoneme arrangements, the likelihood P1 value of “self-employed” is high among the words included in the recognition dictionary 104, but the restriction of phoneme arrangement by the recognition vocabulary The likelihood P2 calculated using the unacceptable phoneme chain model 106 is considered to be lower.
[0063]
Next, in step 203, it is determined whether or not there is a silent part. If there is no sound, the process proceeds to the next frame as it is. If not, the process proceeds to step 204.
[0064]
In step 204, a weight change notification and the values of likelihood P1 and likelihood P2 are output to the phoneme chain model weight change unit. This is because the appropriate phonological chain model weight varies greatly depending on the surrounding environment such as noise and reverberation, and therefore the appropriate phonological chain model weight is set based on the values obtained in the previous analysis.
[0065]
Next, in step 205, it is determined whether to change the weight of the language model. If α × P2 [i] −P1 [i] is a positive value, the weight of the language model is changed in step 205. Otherwise, the process proceeds to the next frame without doing anything. In the second pass, almost the same processing is performed except that a high-accuracy acoustic model is used. In the second pass, the language model weight is not notified to the language model weight change unit and the phonological chain model weight change unit. The changing unit and the phonological chain model weight changing unit maintain the previous values. When the processing is completed for all the frames, the language model weight value and the phonological chain model weight are returned to the values before utterance.
[0066]
Since the language model 105 stores statistical linguistic information based on the n-gram model, words that cannot be used in Japanese as “going to self-employment” are highly likely to be obtained using the acoustic model 107. However, the likelihood obtained using the language model 105 is low, and the likelihood of the word sequence such as “self-employed”, “operating self-employed” is higher, and “go to” in the utterance ", A recognition error occurs. However, by changing the weight of the language model in step 205, the linguistic restrictions are relaxed, and the word sequence such as “go to” with high acoustic likelihood can be continued after “self-employed”. can do. When the above processing is completed for all frames and the second pass processing is also completed, the unknown word section detection unit 112 is notified of the hypothesis with the maximum likelihood (in this example, “go to self-employment”) and each word of the hypothesis. The cumulative likelihood difference S is output. When the weight of the language model is not changed, the hypothesis with the maximum likelihood is often a correct word string in Japanese, such as “self-employed”.
[0067]
In the unknown word section detection unit 112, the output of the recognition unit 103 is the hypothesis “go to self-employment” and the likelihood difference S for each of the words “self-employment”, “ni”, and “go”. As an input. Here, a word whose cumulative likelihood difference S in each word is larger than 0 is determined as an unknown word. From the difference in phoneme arrangement between “Jiyugaoka” and “Self-employed”, the difference between the likelihood P1 and the likelihood P2 is large in “Self-employed”, so that “Self-employed” can be determined as an unknown word. If the phonological chain model weight is not changed, it is difficult to set an appropriate phonological chain model weight depending on the utterance environment, so “self-employed” cannot be determined as an unknown word, or “ni” or “go” is unknown It may be judged as a word. The unknown word section detection unit 112 converts the “self-employed” part to “unknown word” and outputs it to the output unit 113, and the output unit 113 outputs the result to a display device such as a display.
(Example 2)
Hereinafter, as an example of the case where the speech recognition apparatus using the speech recognition method of the present invention is used as a word speech recognition apparatus, an operation when the word “Jiyugaoka” is uttered is shown. FIG. 5 shows an example of a speech recognition apparatus that performs word recognition using the speech recognition method of the present invention. In the case of recognition processing for only words, statistical language information by n-gram is not used. Also, in word recognition, it is often the case that a plurality of paths often used in the previous sentence recognition are not recognized. Here, it is assumed that processing is performed in one pass. Furthermore, the word “Jiyugaoka” is not included in the recognition dictionary 504.
[0068]
First, the input voice is converted into a digital signal by the input unit 501. The voice converted into a digital signal is divided into frames by a short-time spectrum analysis method, and the voice in each frame is converted into a time series of feature vectors by the acoustic analysis unit 502 and output to the recognition unit 503.
[0069]
In the recognition unit 503, processing is performed in each frame according to the flowchart of FIG.
[0070]
First, in step 601, the likelihood P1 is calculated using the acoustic model 507.
Next, in step 602, likelihood P2 is calculated using the phoneme chain model 506 and the acoustic model 507.
Here, since the word “Jiyugaoka” is not included in the recognition dictionary 504, the maximum likelihood P1 is another word included in the recognition dictionary 504, such as “self-employed”. Since “Jiyugaoka” and “self-employed” have different phoneme arrangements, the likelihood P1 value of “self-employed” is high among the words included in the recognition dictionary 504, but the restriction of phoneme arrangement by the recognition vocabulary is limited. It is often lower than the likelihood P2 calculated using the unacceptable phoneme chain model 506.
[0071]
Next, in step 603, it is determined whether or not there is a silent part. If there is no sound, the process proceeds to the next frame as it is. If not, the process proceeds to step 604.
[0072]
Next, in step 604, a weight change notification and likelihood P1 and likelihood P2 values are output to the phonological chain model weight changing unit. This is because the appropriate phonological chain model weight varies greatly depending on the surrounding environment such as noise and reverberation, and therefore the appropriate phonological chain model weight is set based on the values obtained in the previous analysis. When the processing is completed for all frames, the phonological chain model weight is returned to the value before utterance.
[0073]
In the unknown word section detection unit 512, each of the candidate “self-employed” output from the recognition unit 503 and the cumulative likelihood difference S of “self-employed” are input. Here, a word with S greater than 0 is determined as an unknown word. From the difference in phoneme arrangement between “Jiyugaoka” and “Self-employed”, the difference between the likelihood P1 and the likelihood P2 is large in “Self-employed”, so that “Self-employed” can be determined as an unknown word. If the phoneme chain model weight is not changed, it may be difficult to determine “self-employed” as an unknown word because it is difficult to set an appropriate phoneme chain model weight depending on the utterance environment. In the unknown word section detection unit 512, the “self-employed” part is converted to “unknown word” and output to the output unit 513, and the output unit 513 outputs the result to a display device such as a display.
[0074]
【The invention's effect】
As described above, according to the present invention, unknown words can be detected with high accuracy when performing speech recognition. In addition, it is possible to realize unknown word processing that is robust even in recognition of parts other than unknown words.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of a speech recognition apparatus according to the present invention.
FIG. 2 is a flowchart illustrating an operation of a recognition unit according to the present invention.
FIG. 3 is a block diagram showing an example of a conventional speech recognition apparatus.
FIG. 4 is a flowchart showing the operation of a recognition unit in a conventional speech recognition apparatus.
FIG. 5 is a block diagram showing an example of a speech recognition apparatus that performs word recognition using the speech recognition method according to the present invention.
FIG. 6 is a flowchart showing an operation of a recognition unit according to the present invention for word recognition.
[Explanation of symbols]
100, 300, 500 Speech recognition device
101, 301, 501 input section
102, 302, 502 Acoustic analysis unit
103, 303, 503 recognition unit
104, 304, 504 recognition dictionary
105,305 Language model
106, 306, 506 Phonological chain model
107, 307, 507 Acoustic model
108 Language model weight change section
109 Language model weight storage
110, 510 Phonological chain model weight change unit
111, 308, 511 Phoneme chain model weight storage unit
112, 309, 512 Unknown word section detector
113, 310, 513 output section

Claims

In a speech recognition apparatus having a recognition dictionary that stores words, a language model that has learned words stored in the recognition dictionary in advance, and a phonological chain model that stores phonological chain rules that may be in the recognition target language,
A recognition unit that performs analysis using the language model and analysis using the phonological chain model;
A phoneme chain model weight storage unit for storing a value for weighting the analysis result using the phoneme chain model;
A phonological chain model weight changing unit that changes a value stored in the phonological chain model weight storage unit using a result of the analysis using the language model and an analysis using the phonological chain model. A featured voice recognition device.

A language model weight storage unit for storing the weight of the language model, and a language for changing the weight stored in the language model weight storage unit from the results of the analysis using the language model and the analysis using the phonological chain model The speech recognition apparatus according to claim 1, further comprising a model weight changing unit.

The language model weight change unit includes a likelihood difference value obtained by analyzing the language model weight in the recognition unit using the language model and the analysis using the phonological chain model, or the language model weight storage unit. The speech recognition apparatus according to claim 1, wherein the value is obtained by a function of a value stored in

The phonological chain model weight storage unit is a likelihood difference value obtained by analyzing the phonological chain model weight in the recognition unit using the language model and an analysis using the phonological chain model, or the phonological chain model weight. The voice recognition device according to claim 1 or 2, wherein the voice recognition device is changed to a value obtained by a function of a value stored in the storage unit.

The recognition unit searches for a plurality of paths in the search at the time of likelihood calculation, and the language model weight change unit maintains the value stored in the weight storage unit of the language model in the second and subsequent paths. The speech recognition apparatus according to claim 1 or 2, wherein

The recognizing unit searches a plurality of paths in the search at the time of likelihood calculation, and the phoneme chain model weight changing unit maintains a value stored in the phoneme chain model weight storage unit in the second and subsequent paths. The speech recognition apparatus according to claim 1 or 2, wherein

The language model weight storage unit stores a value before utterance separately, and the language model weight change unit determines the value stored in the language model weight storage unit before utterance when it is determined that the utterance has ended. The voice recognition device according to claim 1 or 2, wherein the value is changed to a value of.

The phonological chain model weight storage unit separately stores a value before utterance, and the phonological chain model weight change unit stores the value stored in the phonological chain model weight storage unit when it is determined that the utterance has ended. The speech recognition apparatus according to claim 1, wherein the value is changed to a value before utterance.

The speech recognition according to claim 1 or 2, wherein the language model weight changing unit maintains a value stored in the language model weight storage unit in a processing unit determined to be a silent part. apparatus.

3. The phonological chain model weight changing unit maintains a value stored in the phonological chain model weight storage unit in a processing unit determined as a silent part. Voice recognition device.

In a speech recognition method using a recognition dictionary that stores words, a language model in which each word stored in the recognition dictionary is learned in advance, and a phonological chain model that stores phonological chain rules that can be recognized in the recognition target language,
Recognizing means for performing analysis using the language model and analysis using the phonological chain model for each processing unit divided by time, and outputting a maximum likelihood value in each analysis,
Phoneme chain model weight storage means for storing a value for weighting the analysis using the phoneme chain model;
A phonological chain model that changes the value stored in the phonological chain model weight storage means by using the maximum likelihood values of the analysis using the language model and the analysis using the phonological chain model for each processing unit. And a weight changing means.

Language model weight storage means for storing the weight of the language model;
Language model weight change for changing the weight stored in the language model weight storage means using the maximum likelihood values of the analysis using the language model and the analysis using the phonological chain model for each processing unit The speech recognition method according to claim 11, further comprising: means.

In a speech recognition method comprising: a recognition dictionary storing words; a language model in which each word stored in the recognition dictionary has been learned in advance; and a phonological chain model storing rules of phonological chains that may exist in a recognition target language;
Recognition means for performing analysis using the language model and analysis using the phonological chain model;
Phoneme chain model weight storage means for storing a value for weighting the analysis result using the phoneme chain model;
Phonological chain model weight changing means for changing the value stored by the phonological chain model weight storage means using the results of the analysis using the language model and the analysis using the phonological chain model. A feature of speech recognition.

Language model weight storage means for storing the weight of the language model;
The language model weight changing means for changing the weight stored by the language model weight storage means from the result of the analysis using the language model and the analysis using the phonological chain model, 14. The speech recognition method according to 13.

A speech recognition program that uses a recognition dictionary that stores word information, a language model that learns each word stored in the recognition dictionary in advance, and a phonological chain model that stores phonological chain rules that may exist in the recognition target language. There,
Computer
Recognition means for performing analysis using the language model and analysis using the phonological chain model;
Phoneme chain model weight storage means for storing a value for weighting the analysis result using the phoneme chain model;
Speech recognition for functioning as a phonological chain model weight changing means for changing the value stored by the phonological chain model weight storage means using the analysis using the language model and the analysis using the phonological chain model program.

Computer
Language model weight storage means for storing the weight of the language model;
16. The function according to claim 15, wherein the function is used as a language model weight changing means for changing the weight stored by the language model weight storage means from the result of the analysis using the language model and the analysis using the phonological chain model. Speech recognition program.