JP3589044B2

JP3589044B2 - Speaker adaptation device

Info

Publication number: JP3589044B2
Application number: JP29792498A
Authority: JP
Inventors: 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1998-10-20
Filing date: 1998-10-20
Publication date: 2004-11-17
Anticipated expiration: 2018-10-20
Also published as: JP2000122689A

Abstract

PROBLEM TO BE SOLVED: To prevent erroneous estimation of a prameter for a standard pattern to enhance a recognition rate, even when a recognition result for speaker adoptive leaning is not correct in an educatorless speaker adoptive system. SOLUTION: This speaker adopting device calculates recognition result reliability by a recognition result reliability computing means 101, using a speech recognition result 2006 for speaker adoptive learning which is an output of a collating means 2003, a speech feature parameter which is an output of a speech feature parameter extracting means 2002, and a standard pattern 2004. An educatorless speaker adoptive pattern 2008 is calculated by an eductorless speaker adoptive means 102 with recognition result reliability, using the speech recognition result 2006 which is the output of the collating means 2003, the speech feature parameter which is the output of the extracting means 2002, and the standard pattern 2004.

Description

【０００１】
【発明の属する技術分野】
本発明は、多数の話者の音声データによりパラメータ学習を行った標準パタンを、ある話者に適応した話者適応パタンに更新するようにした教師なし話者適応化装置、及びその話者適応パタンを用いた音声認識装置に関する。
【０００２】
【従来の技術】
音声認識のアプリケーションを想定した場合、事前の話者音声の登録を必要としない不特定話者音声認識システムの要望が高く、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ、以下ＨＭＭとする）、ニューラルネット（ＮｅｕｒａｌＮｅｔｗｏｒｋ、以下ＮＮとする）を用いた音声認識方式による実用化検討が行われている。ＨＭＭ、ＮＮの詳細は、例えば「音声認識の基礎（上、下）」Ｌ．ＲＡＢＩＮＥＲ、Ｂ．Ｈ．ＪＵＡＮＧ、古井監訳、１９９５年、１１月、ＮＴＴアドバンステクノロジ（以下、文献１という）に記されている。これらの方法は、予め多数の話者からの単語、文などの音声データを用いた、標準パタンの学習によって不特定話者標準パタンを作成するものである。
【０００３】
しかしながら、ＨＭＭやＮＮによる不特定話者音声認識システムは、特定話者に限定した場合、その特定話者からの単語、文などの音声データによって標準パタンを学習した特定話者認識システムと比較して、単語誤り率で２〜３倍程度であるのが現状である。そこで不特定話者音声認識システムの向上をはかるため、話者適応技術の研究が最近盛んに行われている。
【０００４】
話者適応化技術は、特定話者の少量の音声データ（以下適応データとする）を用いて、音声認識システムを使用する前や使用中に、不特定話者音声認識システムの標準パタンのパラメータを適応学習して認識率の向上を図るものである。話者適応化方式については、「音声認識における話者適応」松本弘、日本音響学会平成７年春季研究発表会講演論文集、ｐｐ．２７−３０１９９５年３月（以下、文献２という）に詳しい。話者適応化法としては、適応学習データの発話の内容が既知の音声を用いるか、あるいは任意の未知の発話内容の音声を使用するかにより、「教師あり／教師なし」の２つの方法がある。教師あり話者適応方式は、適応データを用いた適応学習後の認識精度は高いが、音声認識装置の使用者が使用前に予め決められた単語や文章を発声しなければならず、使用者の負担が大きい。一方、教師なし話者適応方式は、音声認識装置の使用中に使用者が適応学習を意識することなく認識率の改善を得ようとする方法である。実際の音声認識のアプリケーションでは、教師なし話者適応の確立が望まれている。
【０００５】
従来の教師なし適応化では、入力音声に対して不特定話者用の標準パタンを用いて照合を行い、照合を行った結果として得られる認識結果を発声内容であるとして、不特定話者用標準パタンを連結し、入力音声を適応データとして標準パタンのパラメータを更新する。例えば「ＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎｏｆＣｏｎｔｉｎｕｏｕｓＤｅｎｓｉｔｙＨＭＭｓＵｓｉｎｇＭｕｌｔｉｖａｒｉａｔｅＬｉｎｅａｒＲｅｇｒｅｓｓｉｏｎ」Ｃ．Ｌ．ＬｅｇｇｅｔｔｅｒａｎｄＰ．Ｃ．Ｗｏｏｄｌａｎｄ，Ｐｒｏｃ．ｏｆＩＣＳＬＰ９４、ｐｐ．４５１−４５４、１９９４年（以下、文献３という）で報告されている。
【０００６】
以下に従来例として文献３に記述されている認識結果を発声内容とする教師なし話者適応化装置を図２１のブロック図を参照して説明する。図２１において、入力音声２００１は、認識装置の使用話者が発声した単語や文章の音声である。ここでの１発声はポーズからポーズの間の文節や文章として説明を行う。
【０００７】
音声特徴量抽出手段２００２は入力音声２００１の音声信号をＡ／Ｄ変換し、Ａ／Ｄ変換された信号を５ミリ秒〜２０ミリ秒程度の一定時間間隔のフレームで切り出し、音響分析を行って音声特徴量を抽出する。ここで音声特徴量とは、少い情報量で音声の特徴を表現できるものであり、例えばケプストラム、ケプストラムの動的特徴の物理量で構成する特徴量ベクトルである。
【０００８】
照合手段２００３では、認識辞書２００５でテキスト表記によって設定している認識対象の単語［Ｗ（１），Ｗ（２），．．．，Ｗ（ｗｎ）］（括弧内は単語番号、ｗｎは認識対象単語数）を認識ユニットのラベル表記へ変換し、ラベルに対応した認識ユニットの標準パタン２００４を連結することで認識対象単語の標準パタンを作成する。そして音声特徴量抽出手段２００２からの出力である発声１から発声Ｎまでの音声特徴量の時系列Ｏ＝［ｏ（１），ｏ（２），．．．，ｏ（Ｔ）］（括弧内は時刻、Ｔは最大フレーム数）に対して照合を行い、話者適応学習用音声認識結果２００６を出力する。話者適応学習用音声認識結果２００６は発声に対して最も照合スコア（尤度とも言う）が高い単語番号系列Ｒｎ■＝［ｒ■（１），ｒ■（２），．．．，ｒ■（ｍ■）］を計算し、単語番号に対応した単語のテキスト表記Ｒｗ■＝［Ｗ（ｒ■（１）），Ｗ（ｒ■（２）），．．．，Ｗ（ｒ■（ｍ■））］を出力する。ここで、ｒ■（ｉ）は話者適応学習用音声認識結果２００６の単語列中のｉ番目の単語の単語番号を示す。また、ｍ■は話者適応用音声認識結果２００６の単語列数を示す。
【０００９】
標準パタン２００４は、予め用意した標準パタンであり、文献３では認識ユニットを前後音素環境（コンテキスト）依存の音素としたＨＭＭを用い、多数の話者の音声データでパラメータ学習を行った標準パタンを初期の標準パタンとして使用している。ＨＭＭは状態単位で以下の情報をパラメータとして有することで複数の認識ユニットの標準パタンを形成する。
（ａ）状態番号
（ｂ）受理可能なコンテキストクラス
（ｃ）先行状態及び後続状態のリスト
（ｄ）出力確率密度分布のパラメータ
（ｅ）自己遷移確率確率及び後続状態への遷移確率
【００１０】
認識辞書２００５は、予め定めた認識対象とする単語や文章をテキストで格納し、テキスト表記から認識ユニットラベルへの変換を行って、このラベル系列にしたがって標準パタン２００４から対応する認識ユニット標準パタンを連結して照合手段２００３で用いる認識対象単語の標準パタンを生成する。例えば認識辞書２００５に「あお」が存在するならば、これは音素系列で表した場合は／ａｏ／となる。離散発声の「あお」の認識に用いる標準パタンは中心音素が／ａ／であり、先行音素が無音、後続音素が／ｏ／である認識ユニットのＨＭＭ λ−ａｏと、中心音素が／ｏ／であり、先行音素が／ａ／、後続音素が無音の認識ユニットのＨＭＭ λａｏ−を連結したＨＭＭによって照合を行う。最近ではこのような前後音素環境依存の音素ＨＭＭを用いて、認識対象語彙が４０，０００単語以上の音声認識システムの検討が行われている。
【００１１】
教師なし話者適応手段２００７は、照合手段２００３の出力である話者適応学習用音声認識結果２００６と標準パタン２００４を入力し、認識結果の認識ユニットラベル系列に基づき、標準パタン２００４の音素ＨＭＭを連結し、音声特徴量抽出手段２００２からの出力である音声特徴量の時系列を適応データとして標準パタンのパラメータを更新し、教師なし話者適応パタン２００８を出力する。
【００１２】
文献３では、数式１で示される重回帰写像モデルに基づき、ＨＭＭのパラメータの一つであるガウス分布の平均ベクトルを線形変換することで教師なし話者適応パタン２００８を計算する。数式１においてμ（ｑ）、μａ（ｑ）は更新前後のガウス分布番号ｑの平均ベクトルであり、次元数はｄであり音声特徴量ベクトルの次元数と同じである。Ａはｄ×ｄの変換行列であり、ｖはｄ次元の定数項ベクトルである。変換行列Ａとｖは数式２によってＡのｐ行目、ｖのｐ次元目を算出する。数式２において、Ψは更新を行うガウス分布番号の集合、ｒ（ｉ，ｔ）は時刻ｔにガウス分布ｉに特徴ベクトルｏ（ｔ）が存在する期待値、μ（ｉ，ｒ）はガウス分布ｉの平均ベクトルのｒ次元目の要素、σ２（ｉ，ｐ）はガウス分布ｉの共分散行列のｐ行ｐ列目の要素、ｏ（ｔ，ｐ）は特徴ベクトルｏ（ｔ）のｐ次元目の要素、Ｔは適応データの総フレーム数である。
【数１】

【数２】

【００１３】
教師なし話者適応パタン２００８は、教師なし話者適応手段２００７からの出力であり、この標準パタンを用いて音声認識装置などで音声認識が行われる。
【００１４】
【発明が解決しようとする課題】
しかし、従来の教師なし話者適応化装置では、照合を行って得られた話者適応用認識結果を発声内容として標準パタンのパラメータの更新を行っていたため、話者適応学習用認識結果が誤った場合には、パラメータの誤った推定が行われ認識率が低下する、という問題点があった。
【００１５】
そこで、本発明は、以上の問題点を解決し、教師なし話者適応方式において話者適応学習用認識結果が誤った場合においても、標準パタンのパラメータ誤推定を防ぎ、認識率を向上させることのできる話者適応化装置、およびその話者適応化装置により更新された教師なし話者適応パタン使用して音声認識を行う音声認識装置を提供することを目的とする。
【００１６】
【課題を解決するための手段】
この発明に係る話者適応化装置では、話者の入力音声から抽出した音声特徴量と、多数の話者の音声データによりパラメータ学習を行って得た標準パタンと、を照合して認識結果を出力するとともに、前記標準パタンを、前記入力音声を発した話者に適応した話者適応パタンに更新するか否かを、前記認識結果の信頼度に応じて決定する話者適応化装置において、
前記音声特徴量と前記標準パタンから推定適応パタンを最尤推定により算出する標準パタンパラメータ最尤推定手段と、
この標準パタンパラメータ最尤推定手段により算出された推定適応パタンを構成するパラメータの値と前記標準パタンを構成するパラメータの値とを、前記信頼度に応じて線形補間することにより前記話者適応パタンを算出するパラメータ線形補間手段と、
を備えるものである。
【００１７】
また、次の発明に係る話者適応化装置では、話者の入力音声から抽出した音声特徴量と、多数の話者の音声データによりパラメータ学習を行って得た標準パタンと、を照合して認識結果を出力するとともに、前記標準パタンを、前記入力音声を発した話者に適応した話者適応パタンに更新するか否かを、前記認識結果の信頼度に応じて決定する話者適応化装置において、
前記信頼度に基づいて、前記入力音声から得られる話者適応用データのパラメータ学習への重み付けを計算し、重み付けされた話者適応用データを用いて、前記標準パタンからを構成するパラメータを、前記話者適応パタンを構成するパラメータに更新する適応学習手段を備えたことを特徴とする。
【００１８】
また、次の発明に係る話者適応化装置では、話者の入力音声から抽出した音声特徴量と、多数の話者の音声データによりパラメータ学習を行って得た標準パタンと、を照合して認識結果を出力するとともに、前記標準パタンを、前記入力音声を発した話者に適応した話者適応パタンに更新するか否かを、前記認識結果の信頼度に応じて決定する話者適応化装置において、
過去に出力した前記認識結果の信頼度の値に基づいて、異なる話者適用学習アルゴリズムを選択する話者適用方式選択手段を備えたことを特徴とする。
【００３９】
【発明の実施の形態】
実施の形態１．
図１は、請求項１記載の発明による話者適応化装置の１構成である実施の形態１を示すブロック図である。図１において従来技術の説明図である図２１と同一の機能ブロックは同一の符号を付し説明を省略する。従来技術と異る本発明の特徴的な部分は、認識結果信頼度演算手段１０１を備えたことと、教師なし話者適応手段２００７の代りに認識結果信頼度付き教師なし話者適応手段１０２を備えたことである。
【００４０】
次に図１を参照しながら動作について説明する。認識結果信頼度演算手段１０１は、照合手段２００３からの出力である話者適応学習用音声認識結果２００６と音声特徴量抽出手段２００２からの出力である音声特徴量、及び標準パタン２００４を入力し、話者適応学習用認識結果２００６に対する信頼度を演算する。認識結果の信頼度は、例えば「種々の統計量を用いた単語リジェクト方式の検討」花沢、阿部、日本音響学会平成１０年春期研究発表会講演論文集、ｐｐ．１４１−１４２、１９９８年３月（以降、文献４という）に示されている統計量を用いる。
【００４１】
文献４では、認識結果の信頼度を得るために（１）音響尤度差、（２）音素継続時間長、（３）音素混同行列の３種類の統計量を用いている。
（１）の音響尤度差は、入力音声の話者適応学習用音声認識結果２００６であるＲｗ■のフレーム尤度と、全音素接続の音素タイプライタによる音声認識装置の認識結果Ｒｗ■の区間に対しての尤度の差を数式３により計算して信頼度とするものである。数式３においてｌｔはフレームｔにおける認識結果Ｒｗ■の対数フレーム尤度、Ｌｔは、音素タイプライタによる対数フレーム尤度である。また、ＮはＲｗ■の音素数、ｂｉとｅｉは、ｉ番目の音素の始端と終端フレームである。Ｓａは値が小さいほど信頼性が高い統計量であるので通常はマイナスを乗じた値として信頼度とする。
【数３】

【００４２】
（２）の音素継続時間長は、入力音声に対する話者適応学習用音声認識結果Ｒｗ■の各音素の隣接音素間の継続時間長の整合性に基づく信頼性の統計量であり、数式４によって信頼度を計算する。数式４においてｄｉはＲｗ■を構成するｉ番目の音素を中心として前後１音素づつの継続時間長を並べた３次元のベクトルであり、Ｄｉは他の多数話者の音声データを用いて事前に求めた前記３音素の継続時間長の平均値を並べた３次元ベクトルである。数式４によって演算するＳｄは、認識結果Ｒｗ■中の隣接する３音素間の継続時間長の比が、学習データによって求めた平均時間長の比に近いほど大きな値をとる。したがって、Ｓｄは値が大きいほど認識結果の信頼度が高い統計量である。
【数４】

【００４３】
（３）の音素混同行列は、音素タイプライタによる音素認識を並行して行い、話者適応学習用音声認識結果Ｒｗ■を構成する音素系列と音素タイプライタによる認識結果である音素系列とを時間軸上で対応づけ、事前に求めた音素混同行列を用いて数式５によって信頼度を計算する。数式５において、ｈｉはＲ■ｗを構成するｉ番目の音素モデル、ｐｉｋは音素タイプライタによる音素系列中でｈｉと区間が重なる音素、Ｋｉはｈｉと区間が重なる音素数、ｍ（ｈ，ｐ）は事前に求めた音素ｈ音素ｐの混同率、ｗｉｋはｈｉとｐｉｋとの区間重なり率であり、数式６によって計算する。数式５のＳｃは、値が大きいほど認識結果の信頼度が高い統計量である。最終的なＲｗ■の認識結果信頼度は上記の３種類の統計量を用い、数式７によって計算する。数式７においてｗ２、ｗ３は重み係数であり実験的に設定する。
【数５】

【数６】

【数７】

【００４４】
認識結果信頼度付き教師なし話者適応手段１０２は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度と、照合手段２００３からの出力である話者適応学習用音声認識結果２００６と、音声特徴量抽出手段２００２からの出力である音声特徴量と、標準パタン２００４を入力して標準パタンのパラメータの更新を行い、教師なし話者適応パタン２００８を出力する。
従って、この実施の形態１の話者適応化装置によれば、上記のように認識結果に対して信頼度を付加して教師なし話者適応を行うので認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐので、認識率を向上させることができる。
【００４５】
実施の形態２．
図２は、請求項２記載の発明による話者適応化装置の１構成例である実施の形態２を示すブロック図である。図２において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明の特徴的な部分は、先行する発声によって更新した教師なし話者適応パタン２００８を標準パタン２００４へ代入し、引き続く発声に対して教師なし話者適応を行うことを特徴としたことである。
【００４６】
次に図２を参照しながら動作について説明する。認識結果信頼度付き教師なし話者適応手段１０２は、使用者の最初の発声Ｏ（１）＝［ｏ（ｔ１），ｏ（ｔ１＋１），．．．，ｏ（Ｔ１）］を用いて標準パタン２００４のパラメータを更新して教師なし話者適応パタン２００８を出力する。ここで、この最初の発声によって得られた教師なし話者適応パタンをΛ（１）とする。次にΛ（１）を標準パタン２００４とし、使用者の２番目の発声Ｏ（２）＝［ｏ（ｔ２），ｏ（ｔ２＋１），．．．，ｏ（Ｔ２）］を用いて教師なし話者適応処理によって更に標準パタン２００４を更新して、教師なし話者適応パタン２００８を計算する。このようにｊ番目の発声を用いた教師なし話者適応の更新前の標準パタンとして（ｊ−１）番目の発声までに逐次的に更新したΛ（ｊ−１）を用いる。
従って、この実施の形態２の話者適応化装置によれば、上記のように認識結果に対して信頼度を付加して逐次的に教師なし話者適応を行うので認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐので、認識率を向上させることができる。
【００４７】
実施の形態３．
図３は、請求項３記載の発明による話者適応化装置の認識結果信頼度演算手段の動作説明図であり、実施の形態３の特徴を示す図である。本実施の形態３の特徴的な部分は、認識結果信頼度演算手段１０１から出力である認識結果信頼度は、ポーズで区切られた１発声毎に１つ計算することである。認識結果信頼度演算手段１０１は、図３に示すようにｋ番目の発声の始端と終端をｔｕｓ（ｋ）、ｔｕｅ（ｋ）とした場合に、ｔｕｅ（ｋ）とｔｕｅ（ｋ）との間のフレームに関して１つの認識結果信頼度Ｓｕ（ｋ）を計算して、ｔｕｅ（ｋ）とｔｕｅ（ｋ）との間の各フレームの認識結果信頼度をＳｕ（ｋ）とする。
従って、この実施の形態３の話者適応化装置によれば、上記のように１発声毎に認識結果に対して信頼度を付加して教師なし話者適応を行うので、認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐので、認識率を向上させることができる。
【００４８】
実施の形態４．
図４は、請求項４記載の発明による話者適応化装置の認識結果信頼度演算手段の動作説明図であり、実施の形態４の特徴を示す図である。本実施の形態４の特徴的な部分は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度は、認識ユニットに１つ計算することである。認識ユニットとは標準パタンの基本単位であり、認識ユニットを連結することで認識対象の単語、文章を認識する標準パタンを構成する。認識結果信頼度演算手段１０１は入力音声の話者適応学習用音声認識結果２００６に基づき、認識ユニットラベル系列にしたがって標準パタンを連結し、この標準パタンによって音声特徴量の時系列を認識ユニットに分割する。分割されたｕ番目の認識ユニットの始端と終端をｔｒｓ（ｕ）、ｔｒｅ（ｕ）とした場合に、ｔｒｓ（ｕ）とｔｒｅ（ｕ）の間のフレームに関して１つの認識結果信頼度Ｓｒ（ｕ）を図４のように計算し、区間内のフレームの認識結果信頼度をＳｒ（ｕ）とする。図４は認識結果が５個の認識ユニットによって構成されている例である。
従って、この実施の形態４の話者適応化装置によれば、上記のように１認識ユニット毎に認識結果に対して信頼度を付加して教師なし話者適応を行うので認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐので、認識率を向上させることができる。
【００４９】
実施の形態５．
図５は，請求項５記載の発明による話者適応化装置の認識結果信頼度演算手段の動作説明図であり、実施の形態５の特徴を示す図である。本実施の形態５の特徴的な部分は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度は、音素や音節などの音声単位に１つ計算することである。以下では音声単位が音素である場合で説明する。認識結果信頼度演算手段１０１は入力音声の話者適応学習用音声認識結果２００６の音素系列にしたがって、音声特徴量の時系列を音素単位に分割する。分割されたｐ番目の音素の始端と終端をｔｐｓ（ｐ）、ｔｐｓ（ｐ）とした場合に、ｔｐｓ（ｐ）とｔｐｅ（ｐ）との間のフレームに関しては認識結果信頼度Ｓｐ（ｐ）を図５のように計算して、ｔｐｓ（ｐ）とｔｐｅ（ｐ）との区間内の各フレームの認識結果信頼度をＳｐ（ｐ）とする。図５は入力音声の話者適応学習用認識結果が／ｏｎｓｅｉ／の５音素によって構成されている例である。
従って、この実施の形態５の話者適応化装置によれば、上記のように１音素毎に認識結果に対して信頼度を付加して教師なし話者適応を行うので認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐので、認識率を向上させることができる。
【００５０】
実施の形態６．
図６は、請求項６記載の発明による話者適応化装置の認識結果信頼度演算手段の動作説明図であり、実施の形態６の特徴を示す図である。本実施の形態６の特徴的な部分は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度は、一定時間間隔のフレーム単位に計算することである。以下では図６を参照しながら動作説明を行う。認識結果信頼度演算手段１０１は、入力音声を５ミリ秒〜２０ミリ秒程度の一定時間間隔のフレーム単位に認識結果信頼度を出力する。図６は、フレームｔ〜ｔ＋５毎に認識結果信頼度［Ｓｆ（ｔ），Ｓｆ（ｔ＋１），．．．，，Ｓｆ（ｔ＋５）］を出力を示したものである。
従って、この実施の形態６の話者適応化装置によれば、このように一定時間間隔のフレーム単位で認識結果信頼度を計算するので、認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐことができ、認識率を向上させることができる。
【００５１】
実施の形態７．
図７は、請求項７記載の発明による話者適応化装置の１構成例である実施の形態７を示すブロック図である。図７において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明の特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２は、音声データセグメンテーション手段７０１と認識結果信頼度付き標準パタンパラメータ更新手段７０２で構成することを特徴としたことである。
【００５２】
次に図７を参照しながら動作について説明する。音声データセグメンテーション手段７０１は、話者適応学習用音声認識結果２００６に基づいて、標準パタン２００４から対応する認識ユニット標準パタンを連結し、音声特徴量の時系列を認識ユニット毎にセグメンテーションする。セグメンテーションは、例えば標準パタンがＨＭＭである場合は文献１に記載されているビタービアルゴリズムによって行う。ビタービアルゴリズムは、音声特徴量の時系列［ｏ（１），ｏ（２），．．．，ｏ（ｔ）］に対する１本の最適状態系列［ｑ１，ｑ２，．．．，ｑｔ］を見つけるアルゴリズムである。例えば単語標準パタンが３つの認識ユニットからなり、１認識ユニット当り１状態のＨＭＭであるとし、状態が（ｓ１，ｓ２，ｓ３）で構成されるとする。そしてビタービアルゴリズムによって得られた最適状態系列［ｓ１，ｓ１，ｓ２，ｓ２，ｓ２，ｓ３，ｓ３，ｓ３］であったならば、フレーム１〜２がユニット１、フレーム３〜５がユニット２、フレーム６〜８がユニット３にセグメンテーションされる。
【００５３】
認識結果信頼度付き標準パタンパラメータ更新手段７０２は、認識ユニットの標準パタンパラメータを、セグメンテーションによって分割された音声特徴量と認識結果信頼度を用いて更新する。
従って、この実施の形態７の話者適応化装置によれば、上記のように音声データセグメンテーションを行って識結果信頼度付き標準パタンパラメータの学習を行うので、話者適応学習用認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐことができ、認識率を向上させることができる。
【００５４】
実施の形態８．
図８は、請求項８記載の発明による話者適応化装置の１構成例である実施の形態８を示すブロック図である。図８において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２を、標準パタンパラメータ最尤推定手段８０１と、認識結果信頼度に基づくパラメータ線形補間手段８０２とで構成することである。
【００５５】
次に図８を参照しながら動作について説明する。標準パタンパラメータ最尤推定手段８０１は、音声特徴量抽出手段２００２の出力である音声特徴量と、話者適応学習用音声認識結果２００６に基づいて標準パタン２００４の認識ユニット標準パタンを連結した標準パタンを用いて、標準パタンのパラメータの最尤推定を行い、推定後の標準パタンΛｍを得る。最尤推定は、例えば文献１に記載されているＢａｕｍ−Ｗｅｌｃｈ法によってパラメータ推定を行う。
【００５６】
認識結果信頼度に基づくパラメータ線形補間手段８０２は、標準パタンパラメータ最尤推定手段８０１からの出力である最尤推定後の標準パタンΛｍ、及び推定前の標準パタンΛを入力し、認識結果信頼度演算手段１０１からの出力である認識結果信頼度によってΛｍとΛのパラメータの線形補間を行い、得られた値を教師なし話者適応パタン２００８のパラメータとする。例えば標準パタンがＨＭＭであり、ガウス分布の平均ベクトルμ（ｑ）（ｑはガウス分布の番号）を更新する場合には、数式８によって教師なし話者適応パタン２００８の平均ベクトルμａ（ｑ）を計算する。数式８においてμ（ｑ）、μｍ（ｑ）は最尤推定前後の平均ベクトルの値である。またｗｍ（ｑ）は、値が０から１．０の重み係数であり、μ（ｑ）の更新に用いた適応データの認識結果信頼度によって決定する。
【数８】

従って、この実施の形態８の話者適応化装置によれば、上記のように標準パタンパラメータ最尤推定後に認識結果信頼度に基づいてパラメータの線形補間を行うので、話者適応学習用認識結果が誤っている場合でも、標準パタンのパラメータの誤った更新を防ぐことができ、認識率を向上させることができる。
【００５７】
実施の形態９．
実施の形態９は、実施の形態８の話者適応化装置における標準パタンのパラメータの線形補間において、パラメータの最尤推定に使用した適応データの認識結果信頼度の合計値が大きければ最尤推定値の重みを大きくすることを特徴とした請求項９記載の発明による話者適応化装置である。数式９は数式８の重み係数ｗｍ（ｑ）の値を計算する請求項９記載の発明の１例である。数式９においてＳｆ（ｔ）はフレームｔにおける認識結果信頼度、Ωはパラメータμの更新に用いる適応データのフレームの時刻の集合、τは値が０以上の制御定数である。
【数９】

従って、この実施の形態９の話者適応化装置によれば、上記のように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００５８】
実施の形態１０．
図９は、請求項１０記載の発明による話者適応化装置の１構成例である実施の形態１０を示すブロック図である。図９において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２は、認識結果信頼度重み付き学習データによる適応学習手段９０１で構成することである。
【００５９】
次に図９を参照しながら動作について説明する。認識結果信頼度重み付き学習データによる適応学習手段９０１は、話者適応学習用音声認識結果２００６と標準パタン２００４と認識結果信頼度演算手段１０１の出力である認識結果信頼度と音声特徴量抽出手段２００２の出力である音声特長量の時系列とを入力し、認識結果信頼度によって適応データへ重み付けしたパラメータ更新を行う。例えば、標準パタン２００４がＨＭＭである話者適応化装置では、数式１０によってガウス分布の平均ベクトル、数式１１によってガウス分布の共分散行列の更新を行う。数式１０のｏｈ（ｔ）は認識結果信頼度によって重み付けされた音声特徴量であり、例えば数式１２よって計算する。
数式１２において、μ（ｑ）は更新前のガウス分布の平均ベクトル、ｏ（ｔ）は時刻ｔの音声特徴量であり、τは値が０以上の制御定数、Ｓｆ（ｔ）はフレームｔの認識結果信頼度であるので、Ｓｆ（ｔ）が小さい場合はｏｈ（ｔ）は更新前の平均ベクトルに近い値となり、ｏ（ｔ）のパラメータ更新への寄与度が小さく、またＳｆ（ｔ）が大きい場合は、ｏｈ（ｔ）はｏ（ｔ）に近い値となりパラメータ更新への寄与度が大きくなる。数式１０においてγ（ｑ，ｔ）は、時刻ｔにガウス分布ｑに音声特徴量ｏ（ｔ）が存在する期待値であるが、重み付けされた音声特徴量ｏｈ（ｔ）が存在する期待値として計算してもよい。また、ここで得られたμａ（ｑ）を数式８のμｍ（ｑ）として更新前の標準パタンパラメータとの線形補間を行うことも可能である。
【数１０】

【数１１】

【数１２】

従って、この実施の形態１０の話者適応化装置によれば、このように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００６０】
実施の形態１１．
また、実施の形態１１は、実施の形態１０の話者適応化装置において、認識結果信頼度をフレーム毎に付与し、その値が０〜１であり、信頼度が高い場合には１に近い値を出力することを特長とした請求項１１記載の話者適応化装置である。
従って、この実施の形態１１の話者適応化装置によれば、上記のように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００６１】
実施の形態１２．
図１０は、請求項１２記載の発明による話者適応化装置の１構成例である実施の形態１２を示すブロック図である。図１０において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、照合手段２００３は複数認識結果候補出力照合手段１００１で構成し、認識結果信頼度演算手段１０１は複数認識結果候補信頼度演算手段１００２で構成し、認識結果信頼度付き教師なし話者適応手段１０２は複数認識結果候補信頼度付き教師なし話者適応手段１００３で構成することを特徴としたことである。
【００６２】
次に図１０を参照しながら動作について説明する。複数認識結果候補出力照合手段１００１は、認識辞書２００５によって定められた認識対象単語にしたがい標準パタン２００４連結して、音声特徴量抽出手段２００２の出力である音声特徴量に対して照合を行ない、予め定めた候補数の認識結果［Ｒｗ■（１），Ｒｗ■（２），．．．，Ｒｗ■（Ｎ）］（Ｒｗ■（ｎ）は、入力音声に対してｎ番目にスコアが高い話者適応学習用音声認識結果、Ｎは予め定めた候補数）を照合スコアが高い認識結果候補から順に出力する。
【００６３】
複数認識結果候補信頼度演算手段１００２は、複数認識結果候補出力照合手段１００１の出力である複数認識結果候補［Ｒｗ■（１），Ｒｗ■（２），．．．，Ｒｗ■（Ｎ）］と音声特徴量と標準パタン２００４とを入力して複数の認識結果候補の各々に対して認識結果信頼度［Ｓｍ（１），Ｓｍ（２），．．．，Ｓｍ（Ｎ）］を計算する。ここで、Ｓｍ（ｎ）は入力音声に対するｎ番目の認識結果候補に対する認識結果信頼度の時系列である。認識結果信頼度がフレーム毎のＳｆ（ｎ，ｔ）であるならば、Ｓｍ（ｎ）＝［Ｓｆ（ｎ，１），Ｓｆ（ｎ，２），．．．，Ｓｆ（ｎ，Ｔｎ）］である。複数認識結果候補信頼度付き教師なし話者適応手段１００３は、複数認識結果候補出力照合手段１００１の出力である複数認識結果候補と複数認識結果候補信頼度演算手段１００２からの出力である認識結果信頼度と標準パタン２００４を入力して標準パタンのパラメータ更新を行い、教師なし話者適応パタン２００８を出力する。
【００６４】
複数認識結果候補信頼度付き教師なし話者適応手段１００３は、例えば複数認識結果各々を用いて独立にＮ個の教師なし話者適応パタンを作成して、Ｎ個の標準パタンのパラメータを合成することで最終的な教師なし話者適応パタン２００８を得る方法がある。例えば標準パタンがＨＭＭであり更新するパラメータをガウス分布の平均ベクトル、共分散行列とした場合、数式１３によってガウス分布ｑの平均ベクトル、数式１４によって共分散行列を計算する。数式１３においてμｉ（ｎ，ｑ）は、第ｎ番目の認識結果候補を用いて更新したガウス分布ｑの平均ベクトルであり、数式１４においてＣｉ（ｎ，ｑ）はｎ番目の認識結果候補を用いて更新したガウス分布ｑの共分散行列である。数式１３、数式１４においてβ（ｎ）は第ｎ番目の認識結果候補に対する重み付けであり数式１５によって計算する。数式１５においてＳｉ（ｎ）は第ｎ番目の認識結果候補の認識結果信頼度であり、例えばフレーム毎の認識結果信頼度の合計である。
【数１３】

【数１４】

【数１５】

従って、この実施の形態１２の話者適応化装置によれば、このように複数認識結果候補を出力し複数認識結果候補に対して認識結果信頼度を計算して、認識結果信頼度付き教師なし話者適応を行うので認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００６５】
実施の形態１３．
図１１は、請求項１３記載の発明による話者適応化装置の１構成例である実施の形態１３を示すブロック図である。図１１において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２の前段に、認識結果信頼度比較手段１１０１が付加されていることである。
【００６６】
次に図１１を参照しながら動作について説明する。認識結果信頼度比較手段１１０１は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度を入力し、認識結果信頼度が予め定めた閾値より大きければ、認識結果信頼度付き教師なし話者適応手段１０２で処理を行う。一方、認識結果信頼度が予め定めた閾値より小さければ、標準パタンのパラメータの更新は行わず、標準パタン２００４の値を教師なし話者適応パタン２００８とする。
【００６７】
例えば、１発声の認識結果信頼度の合計が閾値Ｔｈ以下であるならば、この発声を用いた標準パタンのパラメータ更新は行わない話者適応化装置である。また、標準パタンのパラメータ毎にセグメンテーションによって分割された適応データの認識結果信頼度の合計を計算し、パラメータ毎の認識結果信頼度と閾値を比較し、閾値以下であるならばパラメータの更新を行わず、閾値より大きいパラメータは更新を行う話者適応化装置である。
従って、この実施の形態１３の話者適応化装置によれば、このように認識結果信頼度が予め定めた閾値以下であるならばパラメータの更新を行わないように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００６８】
実施の形態１４．
図１２は、請求項１４記載の発明による話者適応化装置の１構成例である実施の形態１４を示すブロック図である。図１２において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２の前段に、認識結果信頼度による話者適応方式選択手段１２０１と、Ｍ個の認識結果信頼度付き教師なし話者適応手段１２０２−１〜１２０２−Ｍを備えたことである。
【００６９】
次に図１２を参照しながら動作について説明する。認識結果信頼度による話者適応方式選択手段１２０１は、認識結果信頼度演算手段１０１からの出力である認識結果信頼度を入力して予め定めた方式選択閾値［Ｔｈ（１），Ｔｈ（２），．．．，Ｔｈ（Ｋ）］によって教師なし話者適応方式の選択を行う。例えば認識結果信頼度の値がＳである場合は、Ｔｈ（ｋ）≦Ｓｕ＜Ｔｈ（ｋ＋１）では認識結果信頼度付き教師なし話者適応方式１２０２−ｋを選択する。ここでＳｕは１発声の認識結果信頼度の合計値である。
【００７０】
認識結果信頼度付き教師なし話者適応手段１２０２−１〜１２０２−Ｍは、例えば「ＡＳｔｕｄｙｏｎＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎｏｆｔｈｅＰａｒａｍｅｔｅｒｓｏｆＣｏｎｔｉｎｕｏｕｓＤｅｎｓｉｔｙＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌｓ」Ｃ．Ｈ．Ｌｅｅ，Ｃ．Ｈ．Ｌｉｎ，Ｂ．Ｈ．Ｊｕａｎｇ，ＩＥＥＥＴＲＡＮＳＡＣＴＩＯＮＯＮＳＩＧＮＡＬＰＥＯＣＥＳＳＩＮＧ，Ｖｏｌ．３９，Ｎｏ．４，１９９１年（以下、文献５という）で提案されている最大事後確率推定法が１２０２−１、「連続混合分布ＨＭＭを用いた移動ベクトル場平滑化話者適応化方式」大倉、杉山、嵯峨山、電子情報通信学会技術報告、ＳＰ９２− １６、１９９２年（以下、文献６という）で提案されている移動ベクトル場平滑化話者適応方式が１２０２−２、重回帰写像モデルに基づく話者適応方式（文献３）が１２０２−３であるとして構成できる。
従って、この実施の形態１４の話者適応化装置によれば、このように構成することで話者適応学習用音声認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００７１】
実施の形態１５．
図１３は、請求項１５記載の発明による話者適応化装置の１構成例である実施の形態１５を示すブロック図である。図１３において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段を、標準パタンパラメータクラスタリング手段１３０１と、認識結果信頼度付きパラメータグループ教師なし話者適応手段１３０２とで構成することである。
【００７２】
次に図１３を参照しながら動作について説明する。標準パタンパラメータグループ化手段１３０１は、標準パタン２００４に格納されている標準パタンパラメータをクラスタリングによってグループ化する。標準パタンがＨＭＭの場合はガウス分布［ｇ（１），ｇ（２），．．，ｇ（Ｍｇ）］（Ｍｇは全ガウス分布数）を例えば数式１６のバタチャリヤの距離によってガウス分布ｇ（ｉ）とｇ（ｊ）間の距離ｄｖ（ｇ（ｉ），ｇ（ｊ））を定義してクラスタリングを行い、グループＧ（ｘ）＝［ｇ（ｘ（１）），ｇ（ｘ（２）），．．．，ｇ（ｘ（ｎ））］（ｘ（．）は分布番号）を決定する。クラスタリング法は例えば文献１に記載されて「るＫ−平均法を用いて行う。
認識結果信頼度付きパラメータグループ教師なし話者適応手段１３０２は、標準パタンパラメータグループ化手段１３０１からの出力である標準パタンパラメータグループと認識結果信頼度演算手段１０１からの出力である認識結果信頼度を入力し、グループ毎に標準パタンパラメータの変動量の計算を行う。例えば標準パタンがＨＭＭである場合の平均ベクトルのｐ次元目の移動量は数式１７によって計算する。数式１７においてα（ｘ）は数式１８に示す信頼度によって決定される重み係数である。また、Ψｘはパラメータグループｘのガウス分布番号の集合、Ωｉはガウス分布番号ｉの適応データの時刻の集合、σ２（ｉ，ｐ）はガウス分布番号ｉの共分散行列のｐ行ｐ列目である。数式１８において、Ｓｆ（ｔ）はフレームｔの認識結果信頼度であり、τは値が０以上の制御定数である。
また、数式１９によってグループｘの平均ベクトルの共通な移動量ｖ（ｘ，ｐ）を求めることも可能である。数式１９においてｏｈ（ｔ）は数式１２に示した認識結果信頼度によって重み付けされた適応データであり、γ（ｉ，ｔ）は、時刻ｔにガウス分布ｉに音声特徴量ｏ（ｔ）が存在する期待値であるが、重み付けされた音声特徴量ｏｈ（ｔ）が存在する期待値として計算してもよい。
【数１６】

【数１７】

【数１８】

【数１９】

従って、この実施の形態１５の話者適応化装置によれば、このように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００７３】
実施の形態１６．
図１４は、請求項１６記載の発明による話者適応化装置の１構成例である実施の形態１６を示すブロック図である。図１４において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２を、標準パタンパラメータ木構造クラスタリング手段１４０１と、木構造化パラメータに基づく標準パタンパラメータグループ化手段１４０２と、認識結果信頼度付きパラメータグループ教師なし話者適応手段１３０２とで構成することである。
【００７４】
次に図１４を参照しながら動作について説明する。標準パタンパラメータ木構造クラスタリング手段１４０１は、標準パタンパラメータを例えば数式１６に示すバタチャリヤの距離によって木構造にクラスタリングする。木構造化は、まず木構造の１階層目のグループ化として全パラメータをＮ個のパラメータグループ［Ｇ（１，１，１），Ｇ（１，１，２），．．．，，Ｇ（１，１，Ｎ）］（Ｇ（ｉ，ｊ，ｋ））：ｉは階層、ｊは親グループ番号、ｋはグループ番号）にクラスタリングする。
次に２階層目のクラスタリングとして、Ｇ（１，ｍ１，ｎ１）を［Ｇ（２，ｎ１，１），Ｇ（２，ｎ１，１），．．．，Ｇ（２，ｎ１，Ｎｎ１）］のグループにクラスタリングする。
さらに３階層目としてＧ（２，ｍ２，ｎ２）を［Ｇ（３，ｎ２，１），Ｇ（３，ｎ２，１），．．．，Ｇ（３，ｎ２，Ｎｎ２）］にクラスタリングにグループ化する。このように予め定めた階層までクラスタリングを行う。木構造パラメータに基づく標準パタンパラメータグループ化手段１４０２は、認識結果信頼度演算手段１０１の出力の認識結果信頼度によって標準パラメータ木構造クラスタリングの出力である木構造化されたパラメータに基づいてパラメータをグループ化する。
【００７５】
図１５は、認識結果信頼度による木構造化パラメータのグループ化の説明図である。ノード以下に属するパラメータの適応データの認識結果信頼度の合計をノードの情報として計算する。子ノードの認識結果信頼度が予め定めた閾値ｔｈより小さく、親ノードの認識結果信頼度がｔｈ以上である場合に、親ノード以下のパラメータグループを子ノード以下に属するパラメータの推定に用いる。図１５において括弧内の数字がノード以下のパラメータの適応データの認識結果信頼度である。例えばｔｈを４０とすれば、Ｎｏｄｅ（３，１）では信頼度２０、その親ノードのＮｏｄｅ（２，１）では１００であるのでパラメータの更新には、Ｎｏｄｅ（２，１）以下のパラメータの適応データと認識結果信頼度および標準パタンパラメータを用いて、パラメータに共通の変動量を求めてＮｏｄｅ（３，２）以下のパラメータ更新を行う。パラメータグループのパラメータ変動量を演算する認識結果信頼度付きパラメータグループ教師なし話者適応手段１３０２は、実施の形態１５で記述したようにパラメータグループにおいて変動量を求め更新を行う。
従って、この実施の形態１７の話者適応化装置によれば、上記のように構成することで認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００７６】
実施の形態１７．
また、実施の形態１７の話者適応化装置は、標準パタンとして、連続混合分布型隠れマルコフモデルを用いることを特徴とした請求項１７記載の発明による話者適応化装置である。連続混合分布型隠れマルコフモデルについては文献１に詳細が記載されているので説明は省略する。
【００７７】
実施の形態１８．
また、実施の形態１８の話者適応化装置は、連続混合分布型隠れマルコフモデルのシンボル出力確率密度関数を構成する要素分布関数はガウス分布であることを特徴とする請求項１８記載の発明による話者適応化装置である。ガウス分布関数は数式２０で与えられる。数式２０において、μ（ｉ）、Ｃ（ｉ）はガウス分布ｉの平均ベクトルと共分散行列である。また、ｄは平均ベクトルの次元数であり、ｏは特徴量ベクトルである。
【数２０】

【００７８】
実施の形態１９．
また、実施の形態１７の話者適応化装置は、適応するパラメータはガウス分布の平均ベクトルであることを特徴とする請求項１９記載の発明による話者適応化装置である。
【００７９】
実施の形態２０．
図１６は、請求項２０記載の発明による話者適応化装置の１構成例である実施の形態２０を示すブロック図である。図１６において、実施の形態１と実施の形態７と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２を、認識結果信頼度付き標準パタンパラメータ更新手段７０２と標準パタンパラメータ補間手段１６０１とで構成することである。
【００８０】
次に図１６を参照して動作について説明する。認識結果信頼度付きパラメータ更新手段７０２は実施の形態８や実施の形態１０に記述したパラメータ更新によってガウス分布の平均値の更新を行う。パラメータ補間手段１６０１は、適応学習データが存在しなかったガウス分布の平均ベクトルを認識結果信頼度付きパラメータ更新手段７０２によって学習されたガウス分布の平均ベクトルの更新前後の差ベクトルを用いて数式２１によって補間する。
【数２１】

【００８１】
図１７はガウス分布平均値の補間の概念図である。図１７においてμ（１）、μ（２）、μ（３）は適応データが存在するガウス分布の平均ベクトルであり、μａ（１）、μａ（２）、μａ（３）は教師なし話者適応によって更新した後の平均ベクトルである。また、μ（４）は適応データが存在しない平均ベクトルである。この適応データが存在しないμ（４）は、数式２１によって、近傍の平均ベクトルの更新前後の差ベクトルによって補間を行う。数式２１において、μ（ｑ）、μａ（ｑ）はｑ番目の更新前後の平均ベクトル、αｐ，ｑは重み係数、ＴＶ（ｐ）は更新前後の平均ベクトルの差ベクトル（移動ベクトル）、Ｐは補間に用いる近傍の平均ベクトルの集合である。またｆは制御定数であり、ｄｐ，ｑはマハラノビス距離であり、Ｃ（ｑ）はガウス分布ｑの共分散行列であり、上付き−１は逆行列を表す。
従って、この実施の形態２０の話者適応化装置によれば、このように適応データが存在しないガウス分布の平均ベクトルは、適応データが存在するガウス分布の平均ベクトルの差ベクトルによって補間を行って適応するので認識結果が誤っている場合でも、標準パタンのパラメータの誤った学習を防ぐことができ、認識率を向上させることができる。
【００８２】
実施の形態２１．
図１８は、請求項２１記載の発明による話者適応化装置の１構成例である実施の形態２１を示すブロック図である。図１８において、実施の形態１と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２は、認識結果信頼度付き重回帰写像モデルに基づく話者適応手段１８０１であることである。
【００８３】
次に図１８を参照しながら動作について説明する。認識結果信頼度付き重回帰写像モデルに基づく話者適応手段１８０１は、認識結果信頼度演算手段１０１の出力である認識結果信頼度と話者適応学習用音声認識結果２００６と標準パタン２００４とを入力し、数式１の重回帰写像モデルに基づく線形変換によってガウス分布の平均ベクトルを更新する。数式１のＡとｖは数式１２に示されている認識結果信頼度によって重み付けした適応データｏｈ（ｔ）を用いて、数式２２によってＡのｐ行目、ｖのｐ次元目の要素を求める。数式２２においてｏｈ（ｔ，ｐ）は認識結果信頼度によって重み付けした適応データｏｈ（ｔ）のｐ次元目の要素であり、その他の変数に関しては数式２と同一である。また、γ（ｉ，ｔ）は、時刻ｔにガウス分布ｉに音声特徴量ｏ（ｔ）が存在する期待値であるが、重み付けされた音声特徴量ｏｈ（ｔ）が存在する期待値として計算してもよい。
【数２２】

【００８４】
また、認識結果信頼度付き重回帰写像モデルに基づく話者適応手段１８０１は、従来の重回帰写像モデルによる話者適応と同様に数式１、数式２によって平均ベクトルを更新してμａ■（ｑ）を求め、このμａ■（ｑ）を数式８のμｍ（ｑ）として認識結果信頼度によって線形補間する構成としてもよい。
従って、この実施の形態２１の話者適応化装置によれば、このように認識結果信頼度付きの重回帰写像モデルに基づく教師なし話者適応を行うので、話者適応学習用認識結果が誤った場合のパラメータの誤った更新を防ぐことができ、認識率が向上する。
【００８５】
実施の形態２２．
図１９は、請求項２２記載の発明による話者適応化装置の１構成例である実施の形態２２を示すブロック図である。図１９において、実施の形態１、実施の形態１５、及び実施の形態１８と同一の機能ブロックは同一の番号を付し説明を省略する。本発明において特徴的な部分は、認識結果信頼度付き教師なし話者適応手段１０２を、ガウス分布グループ化手段１９０１と、認識結果信頼度付き重回帰写像モデルに基づく話者適応手段１８０１とで構成することである。
【００８６】
次に図１９を参照しながら動作について説明する。ガウス分布グループ化手段１９０１は、標準パタン２００４のガウス分布をクラスタリングによってグループ化し、グループ内のガウス分布の適応データの認識結果信頼度に基づいてグループ毎に実施の形態２１で記述した認識結果信頼度付き重回帰写像モデルに基づく話者適応を行う。
従って、この実施の形態２２の話者適応化装置によれば、このように標準パタンをグループ化して認識結果信頼度付きの重回帰写像モデルに基づいて教師なし話者適応を行うので、話者適応学習用認識結果が誤った場合のパラメータの誤った更新を防ぐことができ、認識率が向上する。
【００８７】
実施の形態２３．
図２０は、請求項２３記載の発明による音声認識装置、すなわち上記実施の形態１〜２２の教師なし話者適応化装置により更新された教師なし話者適応パタン２００８を使用した音声認識装置である実施の形態２３の構成を示すブロック図である。尚、図２０において、図１等に示す話者適応化装置と同じ構成には、同一の番号を付して説明を省略する。
【００８８】
認識辞書２００５によって設定した認識対象の単語［Ｗ（１），Ｗ（２），．．．，Ｗ（ｗｎ）］のテキスト表記から認識ユニットラベルへ変換し、このラベルにしたがって教師なし話者適応パタン２００８を連結し、認識対象単語の標準パタンを作成する。この認識対象単語の標準パタンを用いて、音声特徴量抽出手段２００２の出力である音声特徴量に対して照合を行い、音声認識結果２１０１を出力する。このとき、入力音声２００１は教師なし適応用に用いた単語と同一でも、それ以外の単語でも良い。音声認識結果２１０１は、入力音声２００１に対して認識対象語彙の標準パタン中で最も照合スコア（尤度）が高い単語系列のテキスト表記Ｒｗ＝［Ｗ（ｒ（１）），Ｗ（ｒ（２）），．．．，Ｗ（ｒ（ｍ））］としてを出力される。ここで、ｒ（ｉ）は音声認識結果の単語時系列のｉ番目の単語の認識辞書単語番号を示す。また、ｍは認識単語系列の単語数を示す。
従って、この実施の形態２３の音声認識装置によれば、このように認識結果信頼度付きの教師なし話者適応行って得られた教師なし話者適応パタン２００８を用いて音声認識を行うので、話者適応学習用認識結果が誤った場合のパラメータの誤った更新を防ぐことができ、認識率が向上する。
【００９６】
また次の発明によれば、認識結果信頼度付き教師なし話者適応手段は、前記認識ユニットの標準パタンパラメータ更新用の分割された適応データを用い、最尤推定によって標準パタンのパラメータを推定し、認識ユニットの標準パタンのパラメータ更新に用いた適応データの認識結果信頼度の合計値に基づき、最尤推定前後のパラメータの値の線形補間によって前記標準パタンパラメータから前記話者適応パタンのパラメータへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【００９８】
また次の発明によれば、認識結果信頼度付き教師なし話者適応手段は、前記認識ユニットの標準パタンのパラメータ更新用の分割された適応データを用い、認識結果信頼度によって適応データのパラメータ学習への重みを計算して、重み付けされた適応データによって前記標準パタンパラメータから前記話者適応パタンのパラメータへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０２】
また次の発明によれば、認識結果信頼度付き教師なし話者適応手段は、前記第１の発声の認識結果信頼度の値によって更新方法を切り替えるので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０３】
また次の発明によれば、標準パタンのパラメータは、クラスタリングによってグループ化し、グループ内のパラメータの更新用の分割された適応データと認識結果信頼度を用いてグループに共通なパラメータの変動量を演算し、前記パラメータ変動量によって前記標準パタンのグループのパラメータを前記話者適応パタンのグループのパラメータへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０４】
また次の発明によれば、クラスタリングは、木構造クラスタリングを行って木構造状に標準パタンのパラメータをクラスタリングし、木構造のノード以下に属する標準パタンのパラメータ更新用の分割された適応データの認識結果信頼度が閾値以上であるノード以下の標準パタンのパラメータをグループとして、グループ内のパラメータの更新用の分割された適応データと認識結果信頼度を用いてグループに共通なパラメータの変動量を演算し、前記変動量によって前記標準パタンのグループのパラメータを前記話者適応パタンのグループのパラメータへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０５】
また次の発明によれば、標準パタン、及び前記話者適応パタンとして、連続混合分布型隠れマルコフモデルを用いるので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０６】
また次の発明によれば、連続混合分布型隠れマルコフモデルのシンボル出力確率密度関数を構成する要素分布関数は、ガウス分布であるので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０７】
また次の発明によれば、認識結果信頼度付き教師なし話者適応手段において更新するパラメータは前記ガウス分布の平均ベクトルであるので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０８】
また次の発明によれば、ガウス分布の平均ベクトルの更新は、適応データが存在するガウス分布の平均ベクトルは認識結果信頼度付き更新を行い、適応データが存在しないガウス分布の平均ベクトルは適応データが存在するガウス分布の更新前後の平均ベクトルの値の差分ベクトルを用いた補間によって前記標準パタンのパラメータを前記話者適応パタンのパラメータへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１０９】
また次の発明によれば、認識結果信頼度付き教師なし話者適応手段は、認識結果信頼度を用いた重回帰写像モデルに基づく話者適応によって、前記標準パタンのパラメータであるガウス分布の平均ベクトルを前記話者適応パタンのガウス分布の平均ベクトルへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１１０】
また次の発明によれば、重回帰写像モデルに基づく話者適応は、標準パタンのガウス分布をクラスタリングしてグループ化し、グループ内のガウス分布更新用の適応データと認識結果信頼度に基づいてガウス分布のグループに１つの回帰係数を演算し、標準パタンの平均ベクトルを回帰係数を用いて話者適応パタンの平均ベクトルへ更新するので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【０１１１】
また次の発明によれば、請求項１〜２２のうちいずれかに記載の話者適応化装置によって更新された教師なし話者適応パタンと、話者の入力音声から音声特徴量を抽出する音声特徴量抽出手段と、前記音声特徴量抽出手段が抽出した音声特徴量と前記教師なし話者適応パタンとを照合して認識結果を出力する照合手段と、を備えたので、適応学習用音声認識結果が誤った場合でも、標準パタンのパラメータの誤った更新を防ぐことができ認識率が向上する。
【図面の簡単な説明】
【図１】この発明による話者適応化装置の実施の形態１の構成を示すブロック図である。
【図２】この発明による話者適応化装置の実施の形態２の構成を示すブロック図である。
【図３】この発明による話者適応化装置の実施の形態３の動作説明図である。
【図４】この発明による話者適応化装置の実施の形態４の動作説明図である。
【図５】この発明による話者適応化装置の実施の形態５の動作説明図である。
【図６】この発明による話者適応化装置の実施の形態６の動作説明図である。
【図７】この発明による話者適応化装置の実施の形態７の構成を示すブロック図である。
【図８】この発明による話者適応化装置の実施の形態８の構成を示すブロック図である。
【図９】この発明による話者適応化装置の実施の形態１０の構成を示すブロック図である。
【図１０】この発明による話者適応化装置の実施の形態１２の構成を示すブロック図である。
【図１１】この発明による話者適応化装置の実施の形態１３の構成を示すブロック図である。
【図１２】この発明による話者適応化装置の実施の形態１４の構成を示すブロック図である。
【図１３】この発明による話者適応化装置の実施の形態１５の構成を示すブロック図である。
【図１４】この発明による話者適応化装置の実施の形態１６の構成を示すブロック図である。
【図１５】この発明による話者適応化装置の実施の形態１６の動作説明図である。
【図１６】この発明による話者適応化装置の実施の形態２０の構成を示すブロック図である。
【図１７】この発明による話者適応化装置の実施の形態２０の動作説明図である。
【図１８】この発明による話者適応化装置の実施の形態２１の構成を示すブロック図である。
【図１９】この発明による話者適応化装置の実施の形態２２の構成を示すブロック図である。
【図２０】この発明による音声認識装置の実施の形態２３の構成を示すブロック図である。
【図２１】従来の話者適応化装置の構成を示すブロック図である。
【符号の説明】
１０１認識結果信頼度演算手段
１０２認識結果信頼度付き教師なし話者適応手段
７０１音声データセグメンテーション手段
７０２認識結果信頼度付き標準パタンパラメータ更新手段
８０１標準パタンパラメータ最尤推定手段
８０２認識結果信頼度に基づくパラメータ線形補間手段
９０１認識結果信頼度重み付き学習データによる適応学習手段
１００１複数認識結果候補出力照合手段
１００２複数認識結果候補信頼度演算手段
１００３複数認識結果候補信頼度付き教師なし話者適応手段
１１０１認識結果信頼度比較手段
１２０１認識結果信頼度による話者適応方式選択手段
１２０２−１〜Ｍ認識結果信頼度付き教師なし話者適応手段１〜Ｍ
１３０１標準パタンパラメータクラスタリング手段
１３０２認識結果信頼度付きパラメータグループ教師なし話者適応手段
１４０１標準パタンパラメータ木構造クラスタリング手段
１４０２木構造化パラメータに基づく標準パタンパラメータグループ化手段
１６０１標準パタンパラメータ補間手段
１８０１認識結果信頼度付き重回帰写像モデルに基づく話者適応手段
１９０１ガウス分布グループ化手段
２００１入力音声
２００２音声特徴量抽出手段
２００３照合手段
２００４標準パタン
２００５認識辞書
２００６話者適応学習用音声認識結果
２００７教師なし話者適応手段
２００８教師なし話者適応パタン
２１０１音声認識結果[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an unsupervised speaker adaptation apparatus configured to update a standard pattern obtained by performing parameter learning based on voice data of many speakers to a speaker adaptation pattern adapted to a certain speaker, and the speaker adaptation. The present invention relates to a speech recognition device using a pattern.
[0002]
[Prior art]
Assuming a speech recognition application, there is a strong demand for an unspecified speaker speech recognition system that does not require registration of speaker speech in advance. NN) is being studied for practical use by a voice recognition method using the same. Details of the HMM and NN are described in, for example, “Basics of speech recognition (upper, lower)” RABINER, B.A. H. JUANG, edited by Furui Furui, November, 1995, NTT Advanced Technology (hereinafter referred to as Reference 1). In these methods, an unspecified speaker standard pattern is created by learning standard patterns using voice data such as words and sentences from a large number of speakers in advance.
[0003]
However, an unspecified speaker recognition system based on HMM or NN, when limited to a specific speaker, is compared with a specific speaker recognition system that has learned a standard pattern based on voice data such as words and sentences from the specific speaker. At present, the word error rate is about 2-3 times. In order to improve the speaker-independent speaker recognition system, studies on speaker adaptation techniques have been actively conducted recently.
[0004]
The speaker adaptation technique uses a small amount of speech data (hereinafter referred to as adaptation data) of a specific speaker before or during use of the speech recognition system, and sets parameters of a standard pattern of an unspecified speaker speech recognition system. Is adaptively learned to improve the recognition rate. For speaker adaptation methods, see "Speaker Adaptation in Speech Recognition", Hiroshi Matsumoto, Proc. 27-30 March 1995 (hereinafter referred to as Reference 2). There are two speaker adaptation methods, "supervised / unsupervised", depending on whether the content of the utterance of the adaptive learning data uses a known voice or an arbitrary unknown utterance. is there. Although the supervised speaker adaptation method has high recognition accuracy after adaptive learning using adaptive data, the user of the speech recognizer must utter a predetermined word or sentence before use. Burden is great. On the other hand, the unsupervised speaker adaptation method is a method in which the user tries to improve the recognition rate without using the adaptive learning while using the speech recognition device. In an actual speech recognition application, establishment of unsupervised speaker adaptation is desired.
[0005]
In conventional unsupervised adaptation, the input speech is collated using a standard pattern for unspecified speakers, and the recognition result obtained as a result of the collation is regarded as utterance content, and The standard patterns are connected, and the parameters of the standard patterns are updated using the input voice as adaptive data. For example, “Speaker Adaptation of Continuity Density HMMs Using Multivariate Linear Regression” C.I. L. Leggetter and P.L. C. Woodland, Proc. of ICSLP94, pp. 451-454, 1994 (hereinafter referred to as Reference 3).
[0006]
Hereinafter, an unsupervised speaker adaptation apparatus that uses the recognition result described in Document 3 as an utterance content will be described as a conventional example with reference to the block diagram of FIG. In FIG. 21, an input speech 2001 is a speech of a word or a sentence uttered by a speaker of the recognition device. Here, one utterance is described as a phrase or a sentence between pauses.
[0007]
The audio feature amount extraction means 2002 performs A / D conversion on the audio signal of the input audio 2001, cuts out the A / D converted signal in frames at a fixed time interval of about 5 to 20 milliseconds, and performs acoustic analysis. Extract voice features. Here, the speech feature amount is a feature amount that can express a speech feature with a small amount of information, and is, for example, a cepstrum and a feature amount vector configured by physical quantities of dynamic features of the cepstrum.
[0008]
In the matching unit 2003, the recognition target words [W (1), W (2),. . . , W (wn)] (the word number in parentheses is the number of words to be recognized, wn is the number of words to be recognized), and the standard pattern 2004 of the recognition unit corresponding to the label is concatenated. Create a pattern. Then, a time series O = [o (1), o (2),... Of speech features from utterance 1 to utterance N output from the speech feature extraction means 2002. . . , O (T)] (the time in parentheses is the time, and T is the maximum number of frames), and the speaker recognition learning speech recognition result 2006 is output. The speech recognition result 2006 for speaker adaptation learning shows a word number sequence Rn 照合 = [r ■ (1), r ■ (2),. . . , R ■ (m ■)], and the text representation of the word corresponding to the word number Rw ■ = [W (r ■ (1)), W (r ■ (2)),. . . , W (r ■ (m ■))]. Here, r ■ (i) indicates the word number of the i-th word in the word string of the speech recognition result 2006 for speaker adaptation learning. In addition, m を indicates the number of word strings of the speech recognition result 2006 for speaker adaptation.
[0009]
The standard pattern 2004 is a standard pattern prepared in advance. In Reference 3, a standard pattern obtained by performing parameter learning with speech data of a large number of speakers using an HMM in which a recognition unit is a phoneme depending on a preceding and succeeding phoneme environment (context) is used. Used as an initial standard pattern. The HMM forms the standard pattern of a plurality of recognition units by having the following information as parameters in units of states.
(A) State number
(B) Acceptable context class
(C) List of preceding and succeeding states
(D) Parameters of output probability density distribution
(E) Self transition probability and transition to subsequent state
[0010]
The recognition dictionary 2005 stores a predetermined word or sentence to be recognized as a text, converts text notation into a recognition unit label, and converts a corresponding recognition unit standard pattern from the standard pattern 2004 according to the label sequence. The standard pattern of the recognition target word used by the collation means 2003 is generated by linking. For example, if "Ao" is present in the recognition dictionary 2005, it will be / ao / when represented by a phoneme sequence. The standard pattern used for recognition of “Ao” of discrete utterance is HMM λ-ao of a recognition unit in which the central phoneme is / a /, the preceding phoneme is silent, and the succeeding phoneme is / o /, and the central phoneme is / o / The matching is performed by the HMM in which the HMM λao- of the recognition unit in which the preceding phoneme is / a / and the succeeding phoneme is silent is connected. Recently, a speech recognition system in which the vocabulary to be recognized has 40,000 words or more has been studied using such a phoneme HMM depending on the surrounding phoneme environment.
[0011]
The unsupervised speaker adaptation unit 2007 receives the speech recognition result for speaker adaptation learning 2006 output from the matching unit 2003 and the standard pattern 2004, and generates a phoneme HMM of the standard pattern 2004 based on the recognition unit label sequence of the recognition result. Then, the parameters of the standard pattern are updated using the time series of the voice feature amount output from the voice feature amount extraction unit 2002 as adaptive data, and an unsupervised speaker adaptive pattern 2008 is output.
[0012]
In Literature 3, an unsupervised speaker adaptation pattern 2008 is calculated by linearly transforming an average vector of a Gaussian distribution, which is one of the parameters of the HMM, based on a multiple regression mapping model represented by Expression 1. In Expression 1, μ (q) and μa (q) are average vectors of the Gaussian distribution number q before and after updating, and the number of dimensions is d, which is the same as the number of dimensions of the speech feature amount vector. A is a d × d transformation matrix, and v is a d-dimensional constant term vector. For the transformation matrices A and v, the p-th row of A and the p-th dimension of v are calculated by Expression 2. In Equation 2, Ψ is a set of Gaussian distribution numbers to be updated, r (i, t) is an expected value at which feature vector o (t) exists in Gaussian distribution i at time t, and μ (i, r) is a Gaussian distribution The r-th element of the mean vector of i, σ2 (i, p) is the element at the p-th row and p-th column of the covariance matrix of Gaussian distribution i, and o (t, p) is the p-dimensional of the feature vector o (t) The eye element, T, is the total number of frames of adaptive data.
(Equation 1)

(Equation 2)

[0013]
The unsupervised speaker adaptation pattern 2008 is an output from the unsupervised speaker adaptation means 2007, and speech recognition is performed by a speech recognition device or the like using this standard pattern.
[0014]
[Problems to be solved by the invention]
However, in the conventional unsupervised speaker adaptation apparatus, the recognition result for speaker adaptation obtained by performing the matching is used to update the parameters of the standard pattern as the utterance content. In such a case, there is a problem that the parameter is erroneously estimated and the recognition rate is reduced.
[0015]
Therefore, the present invention solves the above problems, and prevents erroneous estimation of the standard pattern parameters even when the recognition result for speaker adaptation learning is incorrect in the unsupervised speaker adaptation method, and improves the recognition rate. It is an object of the present invention to provide a speaker adaptation apparatus capable of performing the above-described steps, and a speech recognition apparatus that performs speech recognition using an unsupervised speaker adaptation pattern updated by the speaker adaptation apparatus.
[0016]
[Means for Solving the Problems]
In the speaker adapting apparatus according to the present invention,A speech feature extracted from a speaker's input speech is compared with a standard pattern obtained by performing parameter learning based on speech data of a large number of speakers, and a recognition result is output. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
A standard pattern parameter maximum likelihood estimating means for calculating an estimated adaptive pattern from the voice features and the standard pattern by maximum likelihood estimation,
The speaker adaptive pattern is obtained by linearly interpolating the values of the parameters constituting the estimated adaptive pattern calculated by the standard pattern parameter maximum likelihood estimating means and the values of the parameters constituting the standard pattern according to the reliability. Parameter linear interpolation means for calculating
It is provided with.
[0017]
Further, in the speaker adapting apparatus according to the next invention,A speech feature extracted from a speaker's input speech is compared with a standard pattern obtained by performing parameter learning based on speech data of a large number of speakers, and a recognition result is output. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
Based on the reliability, calculate the weight to the parameter learning of the speaker adaptation data obtained from the input voice, using the weighted speaker adaptation data, the parameters constituting the standard pattern, Comprising adaptive learning means for updating to parameters constituting the speaker adaptation patternIt is characterized by the following.
[0018]
Further, in the speaker adapting apparatus according to the next invention,A speech feature extracted from a speaker's input speech is compared with a standard pattern obtained by performing parameter learning based on speech data of a large number of speakers, and a recognition result is output. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
A speaker application method selecting unit that selects a different speaker application learning algorithm based on the value of the reliability of the recognition result output in the past.It is characterized by the following.
[0039]
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiment 1 FIG.
FIG. 1 is a block diagram showing Embodiment 1 which is one configuration of the speaker adapting apparatus according to the first aspect of the present invention. In FIG. 1, the same functional blocks as those in FIG. 21, which is an explanatory diagram of the related art, are denoted by the same reference numerals, and description thereof will be omitted. The feature of the present invention, which is different from the prior art, is that a recognition result reliability calculation means 101 is provided, and an unsupervised speaker adaptation means 102 with recognition result reliability is used instead of the unsupervised speaker adaptation means 2007. It is prepared.
[0040]
Next, the operation will be described with reference to FIG. Recognition result reliability calculation means 101 inputs speaker adaptive learning speech recognition result 2006 output from collation means 2003, speech feature quantity output from speech feature quantity extraction means 2002, and standard pattern 2004, The reliability of the speaker adaptive learning recognition result 2006 is calculated. The reliability of the recognition result is described in, for example, “Study of Word Rejection Method Using Various Statistics”, Hanazawa, Abe, Proc. 141-142, March 1998 (hereinafter referred to as reference 4).
[0041]
In Reference 4, three kinds of statistics, (1) acoustic likelihood difference, (2) phoneme duration, and (3) phoneme confusion matrix are used to obtain the reliability of the recognition result.
The acoustic likelihood difference of (1) is an interval between the frame likelihood of Rw ■, which is the speech recognition result 2006 for speaker adaptation learning of the input speech, and the recognition result Rw ■ of the speech recognition device using the phoneme typewriter of all phoneme connections. Is calculated by Equation 3 as a reliability. In Expression 3, lt is the logarithmic frame likelihood of the recognition result Rw ■ in the frame t, and Lt is the logarithmic frame likelihood by the phoneme typewriter. N is the number of phonemes of Rw ■, and bi and ei are the start and end frames of the i-th phoneme. Since the smaller the value of Sa, the higher the reliability of the statistic, the reliability is usually set to a value multiplied by minus.
(Equation 3)

[0042]
The phoneme duration of (2) is a reliability statistic based on the consistency of the duration of adjacent phonemes of each phoneme in the speech recognition result for speaker adaptation learning Rw # for the input speech. Calculate reliability. In Equation 4, di is a three-dimensional vector in which the durations of one phoneme before and after the i-th phoneme constituting Rw 中心 are arranged, and Di is previously determined using voice data of other multiple speakers. It is a three-dimensional vector in which the obtained average values of the durations of the three phonemes are arranged. Sd calculated by Expression 4 takes a larger value as the ratio of the durations between three adjacent phonemes in the recognition result Rw ■ is closer to the ratio of the average durations obtained from the learning data. Therefore, the larger the value of Sd, the higher the reliability of the recognition result.
(Equation 4)

[0043]
The phoneme confusion matrix of (3) is obtained by performing phoneme recognition by a phoneme typewriter in parallel, and converting a phoneme sequence constituting a speech recognition result Rw ■ for speaker adaptation learning and a phoneme sequence as a recognition result by the phoneme typewriter into time. The reliability is calculated by Equation 5 using the phoneme confusion matrix obtained in advance on the axes. In Equation 5, hi is the i-th phoneme model constituting R ■ w, pik is a phoneme whose section overlaps hi in a phoneme sequence by a phoneme typewriter, Ki is the number of phonemes whose section overlaps hi, and m (h, p ) Is the confusion rate of the phoneme h and the phoneme p obtained in advance, and wik is the section overlap rate between hi and pik, which is calculated by Equation 6. Sc in Expression 5 is a statistic in which the larger the value, the higher the reliability of the recognition result. The final recognition result reliability of Rw ■ is calculated by Expression 7 using the above three types of statistics. In Equation 7, w2 and w3 are weighting coefficients and are set experimentally.
(Equation 5)

(Equation 6)

(Equation 7)

[0044]
The unsupervised speaker adaptation means with recognition result reliability 102 includes a recognition result reliability output from the recognition result reliability calculation means 101 and a speech recognition result 2006 for speaker adaptation learning output from the matching means 2003. Then, the speech feature quantity output from the speech feature quantity extraction means 2002 and the standard pattern 2004 are input to update the parameters of the standard pattern, and an unsupervised speaker adaptation pattern 2008 is output.
Therefore, according to the speaker adaptation apparatus of the first embodiment, since unsupervised speaker adaptation is performed by adding reliability to the recognition result as described above, even if the recognition result is incorrect, Since erroneous updating of pattern parameters is prevented, the recognition rate can be improved.
[0045]
Embodiment 2 FIG.
FIG. 2 is a block diagram showing Embodiment 2 which is one configuration example of the speaker adaptation apparatus according to the second aspect of the present invention. 2, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation pattern 2008 updated by the preceding utterance is substituted into the standard pattern 2004, and unsupervised speaker adaptation is performed on the subsequent utterance. .
[0046]
Next, the operation will be described with reference to FIG. The unsupervised speaker adaptation means 102 with recognition result reliability outputs the user's first utterance O (1) = [o (t1), o (t1 + 1),. . . , O (T1)], and outputs the unsupervised speaker adaptation pattern 2008 by updating the parameters of the standard pattern 2004. Here, the unsupervised speaker adaptation pattern obtained by the first utterance is defined as Λ (1). Next, Λ (1) is set as a standard pattern 2004, and the user's second utterance O (2) = [o (t2), o (t2 + 1),. . . , O (T2)], the standard pattern 2004 is further updated by unsupervised speaker adaptation processing, and an unsupervised speaker adaptation pattern 2008 is calculated. As described above, 標準 (j−1) sequentially updated up to the (j−1) th utterance is used as the standard pattern before updating the unsupervised speaker adaptation using the jth utterance.
Therefore, according to the speaker adaptation apparatus of the second embodiment, since the unsupervised speaker adaptation is sequentially performed by adding reliability to the recognition result as described above, the recognition result is incorrect. However, erroneous updating of the parameters of the standard pattern is prevented, so that the recognition rate can be improved.
[0047]
Embodiment 3 FIG.
FIG. 3 is a diagram for explaining the operation of the recognition result reliability calculating means of the speaker adapting apparatus according to the third aspect of the present invention, and is a diagram showing the features of the third embodiment. A characteristic part of the third embodiment is that one recognition result reliability output from the recognition result reliability calculation means 101 is calculated for each utterance separated by a pause. When the start and end of the k-th utterance are tus (k) and tue (k), as shown in FIG. 3, the recognition result reliability calculating means 101 determines the interval between tue (k) and tue (k). One recognition result reliability Su (k) is calculated for the frame of, and the recognition result reliability of each frame between tue (k) and tue (k) is set to Su (k).
Therefore, according to the speaker adaptation apparatus of the third embodiment, since the unsupervised speaker adaptation is performed by adding reliability to the recognition result for each utterance as described above, the recognition result is erroneously obtained. Even if there is, erroneous updating of the parameters of the standard pattern is prevented, so that the recognition rate can be improved.
[0048]
Embodiment 4 FIG.
FIG. 4 is a diagram for explaining the operation of the recognition result reliability calculating means of the speaker adapting apparatus according to the fourth aspect of the present invention, and is a diagram showing the features of the fourth embodiment. A characteristic part of the fourth embodiment is that one recognition result reliability, which is an output from the recognition result reliability calculation means 101, is calculated by the recognition unit. The recognition unit is a basic unit of a standard pattern, and forms a standard pattern for recognizing a word or a sentence to be recognized by connecting the recognition units. Based on the speech recognition result 2006 for speaker adaptation learning of the input speech, the recognition result reliability calculation means 101 connects the standard patterns according to the recognition unit label sequence, and divides the time series of the speech feature amount into the recognition units by the standard patterns. I do. When the start and end of the divided u-th recognition unit are trs (u) and tr (u), one recognition result reliability Sr (u) is obtained for a frame between trs (u) and tr (u). ) Is calculated as shown in FIG. 4, and the recognition result reliability of the frame in the section is defined as Sr (u). FIG. 4 shows an example in which the recognition result is composed of five recognition units.
Therefore, according to the speaker adaptation apparatus of the fourth embodiment, since unsupervised speaker adaptation is performed by adding reliability to the recognition result for each recognition unit as described above, the recognition result is erroneously obtained. Even if there is, erroneous updating of the parameters of the standard pattern is prevented, so that the recognition rate can be improved.
[0049]
Embodiment 5 FIG.
FIG. 5 is a diagram for explaining the operation of the recognition result reliability calculating means of the speaker adaptation apparatus according to the fifth aspect of the present invention, and is a diagram showing the features of the fifth embodiment. A characteristic part of the fifth embodiment is that one recognition result reliability output from the recognition result reliability calculation means 101 is calculated for each voice unit such as a phoneme or a syllable. Hereinafter, a case where the voice unit is a phoneme will be described. The recognition result reliability calculation means 101 divides the time series of speech feature amounts into phoneme units according to the phoneme sequence of the speech recognition result 2006 for speaker adaptation learning of the input speech. Assuming that the start and end of the divided p-th phoneme are tps (p) and tps (p), the recognition result reliability Sp (p) for the frame between tps (p) and tpe (p) Is calculated as shown in FIG. 5, and the recognition result reliability of each frame in the section between tps (p) and tpe (p) is set to Sp (p). FIG. 5 shows an example in which the recognition result for speaker adaptation learning of input speech is composed of five phonemes of / onsei /.
Therefore, according to the speaker adaptation apparatus of the fifth embodiment, since unsupervised speaker adaptation is performed by adding reliability to the recognition result for each phoneme as described above, the recognition result is incorrect. Even in this case, erroneous updating of the standard pattern parameters is prevented, so that the recognition rate can be improved.
[0050]
Embodiment 6 FIG.
FIG. 6 is a diagram for explaining the operation of the recognition result reliability calculating means of the speaker adapting apparatus according to the sixth aspect of the present invention, and is a diagram showing the features of the sixth embodiment. A characteristic part of the sixth embodiment is that the recognition result reliability output from the recognition result reliability calculation means 101 is calculated in units of frames at fixed time intervals. The operation will be described below with reference to FIG. The recognition result reliability calculation means 101 outputs the recognition result reliability of the input speech in frame units at a fixed time interval of about 5 to 20 milliseconds. FIG. 6 shows the recognition result reliability [Sf (t), Sf (t + 1),. . . ,, Sf (t + 5)].
Therefore, according to the speaker adaptation apparatus of the sixth embodiment, since the recognition result reliability is calculated in units of frames at fixed time intervals, even if the recognition result is wrong, the parameter of the standard pattern Erroneous updating can be prevented, and the recognition rate can be improved.
[0051]
Embodiment 7 FIG.
FIG. 7 is a block diagram showing a seventh embodiment which is one configuration example of the speaker adaptation apparatus according to the seventh aspect of the present invention. In FIG. 7, the same functional blocks as in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means with recognition result reliability 102 comprises a speech data segmentation means 701 and a standard pattern parameter updating means with recognition result reliability 702. is there.
[0052]
Next, the operation will be described with reference to FIG. The speech data segmentation means 701 connects the corresponding recognition unit standard pattern from the standard pattern 2004 based on the speech recognition result 2006 for speaker adaptation learning, and segments the time series of the speech feature amount for each recognition unit. The segmentation is performed by the Viterbi algorithm described in Reference 1 when the standard pattern is an HMM, for example. The Viterbi algorithm uses a time series [o (1), o (2),. . . , O (t)] for one optimal state sequence [q1, q2,. . . , Qt]. For example, it is assumed that a word standard pattern is composed of three recognition units, and that one recognition unit is an HMM with one state, and the state is composed of (s1, s2, s3). If the optimal state sequence obtained by the Viterbi algorithm is [s1, s1, s2, s2, s2, s3, s3, s3], frames 1 to 2 are unit 1, frames 3 to 5 are unit 2, Frames 6-8 are segmented into unit 3.
[0053]
The standard pattern parameter with recognition result reliability updating unit 702 updates the standard pattern parameter of the recognition unit using the speech feature amount divided by the segmentation and the recognition result reliability.
Therefore, according to the speaker adaptation apparatus of the seventh embodiment, since the speech data segmentation is performed to learn the standard pattern parameter with intelligibility reliability as described above, the recognition result for speaker adaptation learning is incorrect. In this case, erroneous updating of the standard pattern parameters can be prevented, and the recognition rate can be improved.
[0054]
Embodiment 8 FIG.
FIG. 8 is a block diagram showing an eighth embodiment which is one configuration example of the speaker adaptation apparatus according to the eighth aspect of the present invention. 8, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means 102 with recognition result reliability is configured by a standard pattern parameter maximum likelihood estimation means 801 and a parameter linear interpolation means 802 based on recognition result reliability. is there.
[0055]
Next, the operation will be described with reference to FIG. The standard pattern parameter maximum likelihood estimating unit 801 is a standard pattern obtained by connecting the speech feature amount output from the speech feature amount extracting unit 2002 and the recognition unit standard pattern of the standard pattern 2004 based on the speech recognition result 2006 for speaker adaptation learning. Is used to perform the maximum likelihood estimation of the parameters of the standard pattern to obtain the estimated standard pattern Λm. For maximum likelihood estimation, parameter estimation is performed by, for example, the Baum-Welch method described in Reference 1.
[0056]
The parameter linear interpolation means 802 based on the recognition result reliability inputs the standard pattern {m} after the maximum likelihood estimation, which is the output from the standard pattern parameter maximum likelihood estimation means 801, and the standard pattern 前 before the estimation, and outputs the recognition result reliability. The parameters of の m and Λ are linearly interpolated according to the recognition result reliability output from the calculating means 101, and the obtained values are used as the parameters of the unsupervised speaker adaptation pattern 2008. For example, when the standard pattern is an HMM and the average vector μ (q) of the Gaussian distribution is updated (q is the number of the Gaussian distribution), the average vector μa (q) of the unsupervised speaker adaptation pattern 2008 is calculated by Expression 8. calculate. In Expression 8, μ (q) and μm (q) are the values of the average vector before and after the maximum likelihood estimation. Wm (q) is a weight coefficient having a value of 0 to 1.0, and is determined based on the recognition result reliability of the adaptive data used for updating μ (q).
(Equation 8)

Therefore, according to the speaker adaptation apparatus of the eighth embodiment, since the parameter is linearly interpolated based on the recognition result reliability after the standard pattern parameter maximum likelihood estimation as described above, the recognition result for speaker adaptation learning is obtained. Is incorrect, it is possible to prevent erroneous updating of the standard pattern parameters, and improve the recognition rate.
[0057]
Embodiment 9 FIG.
In the ninth embodiment, in the linear interpolation of the parameters of the standard pattern in the speaker adaptation apparatus of the eighth embodiment, the maximum likelihood estimation is performed if the total value of the recognition result reliability of the adaptive data used for the maximum likelihood estimation of the parameter is large. A speaker adapting apparatus according to claim 9, wherein the weight of the value is increased. Expression 9 is an example of the invention according to claim 9, which calculates the value of the weighting coefficient wm (q) in Expression 8. In Expression 9, Sf (t) is the recognition result reliability in the frame t, Ω is the set of frame times of adaptive data used for updating the parameter μ, and τ is a control constant having a value of 0 or more.
(Equation 9)

Therefore, according to the speaker adaptation apparatus of the ninth embodiment, even if the recognition result is wrong, it is possible to prevent erroneous learning of the parameters of the standard pattern by the configuration described above, and to improve the recognition rate. Can be improved.
[0058]
Embodiment 10 FIG.
FIG. 9 is a block diagram showing a tenth embodiment which is one configuration example of the speaker adapting apparatus according to the tenth aspect of the present invention. In FIG. 9, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means 102 with recognition result reliability is constituted by an adaptive learning means 901 using learning data with recognition result reliability weighted.
[0059]
Next, the operation will be described with reference to FIG. The adaptive learning means 901 based on the recognition result weighted learning data includes a speech recognition result 2006 for speaker adaptive learning, a standard pattern 2004, and a recognition result reliability and a speech feature amount extraction means which are outputs of the recognition result reliability calculation means 101. The time series of the speech feature quantity which is the output of 2002 is input, and the parameter is updated by weighting the adaptive data according to the recognition result reliability. For example, in a speaker adaptation apparatus in which the standard pattern 2004 is an HMM, a mean vector of a Gaussian distribution is updated by Expression 10 and a covariance matrix of Gaussian distribution is updated by Expression 11. Oh (t) in Expression 10 is a speech feature value weighted by the recognition result reliability, and is calculated by Expression 12, for example.
In Expression 12, μ (q) is an average vector of the Gaussian distribution before updating, o (t) is a speech feature amount at time t, τ is a control constant having a value of 0 or more, and Sf (t) is a value of frame t. Since the recognition result is reliability, when Sf (t) is small, oh (t) becomes a value close to the average vector before update, and the contribution of o (t) to the parameter update is small, and Sf (t) is small. Is large, oh (t) becomes a value close to o (t), and the degree of contribution to parameter update increases. In Expression 10, γ (q, t) is an expected value at which the voice feature amount o (t) exists in the Gaussian distribution q at time t, and is an expected value at which the weighted voice feature amount oh (t) exists. It may be calculated. Further, it is also possible to use μa (q) obtained here as μm (q) in Expression 8 and perform linear interpolation with the standard pattern parameters before updating.
(Equation 10)

[Equation 11]

(Equation 12)

Therefore, according to the speaker adaptation apparatus of the tenth embodiment, even if the recognition result is incorrect, erroneous learning of the parameters of the standard pattern can be prevented, and the recognition rate can be reduced. Can be improved.
[0060]
Embodiment 11 FIG.
Also, in the eleventh embodiment, in the speaker adaptation apparatus of the tenth embodiment, the recognition result reliability is given for each frame, and the value is 0 to 1, and is close to 1 when the reliability is high. The speaker adapting apparatus according to claim 11, wherein the apparatus outputs a value.
Therefore, according to the speaker adaptation apparatus of the eleventh embodiment, even if the recognition result is incorrect, the configuration described above can prevent erroneous learning of the parameters of the standard pattern, and the recognition rate can be reduced. Can be improved.
[0061]
Embodiment 12 FIG.
FIG. 10 is a block diagram showing a twelfth embodiment which is one configuration example of the speaker adaptation apparatus according to the twelfth aspect of the present invention. In FIG. 10, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the matching means 2003 is constituted by a plurality of recognition result candidate output matching means 1001 and the recognition result reliability calculating means 101 is constituted by a plurality of recognition result candidate reliability calculating means 1002. The unsupervised speaker adaptation means 102 is characterized in that it comprises an unsupervised speaker adaptation means 1003 with a plurality of recognition result candidate reliability.
[0062]
Next, the operation will be described with reference to FIG. The multiple-recognition-result candidate output collating unit 1001 connects the standard pattern 2004 according to the recognition target word determined by the recognition dictionary 2005, and performs collation with the speech feature amount output from the speech feature amount extracting unit 2002. The recognition results [RwＲ (1), Rw ■ (2),. . . , Rw ■ (N)] (Rw ■ (n) is the speech recognition result for speaker adaptation learning having the n-th highest score with respect to the input speech, N is a predetermined number of candidates). Output in order from the candidate.
[0063]
The multiple recognition result candidate reliability calculating means 1002 outputs the multiple recognition result candidates [Rw ■ (1), Rw ■ (2),. . . , Rw ■ (N)], the speech feature value, and the standard pattern 2004, and the recognition result reliability [Sm (1), Sm (2),. . . , Sm (N)]. Here, Sm (n) is a time series of the recognition result reliability of the n-th recognition result candidate for the input speech. If the recognition result reliability is Sf (n, t) for each frame, Sm (n) = [Sf (n, 1), Sf (n, 2),. . . , Sf (n, Tn)]. The unsupervised speaker adaptation means with multiple recognition result candidate reliability 1003 includes a multiple recognition result candidate output from the multiple recognition result candidate output matching means 1001 and a recognition result reliability output from the multiple recognition result candidate reliability calculation means 1002. The degree and the standard pattern 2004 are input, the parameters of the standard pattern are updated, and an unsupervised speaker adaptation pattern 2008 is output.
[0064]
The unsupervised speaker adaptation means with multiple recognition result candidate reliability 1003 independently creates N unsupervised speaker adaptation patterns using each of the multiple recognition results, and synthesizes the parameters of the N standard patterns. Thus, there is a method of obtaining a final unsupervised speaker adaptation pattern 2008. For example, if the standard pattern is an HMM and the parameters to be updated are a Gaussian mean vector and a covariance matrix, the mean vector of the Gaussian distribution q is calculated by Expression 13 and the covariance matrix is calculated by Expression 14. In Expression 13, μi (n, q) is an average vector of the Gaussian distribution q updated using the nth recognition result candidate, and in Expression 14, Ci (n, q) uses the nth recognition result candidate. Is the covariance matrix of the Gaussian distribution q updated by the above. In Expressions 13 and 14, β (n) is a weight for the n-th recognition result candidate and is calculated by Expression 15. In Equation 15, Si (n) is the recognition result reliability of the n-th recognition result candidate, and is, for example, the total of the recognition result reliability for each frame.
(Equation 13)

[Equation 14]

(Equation 15)

Therefore, according to the speaker adaptation apparatus of the twelfth embodiment, a plurality of recognition result candidates are output in this way, the recognition result reliability is calculated for the plurality of recognition result candidates, and there is no teacher with the recognition result reliability. Since speaker adaptation is performed, even if the recognition result is incorrect, erroneous learning of the standard pattern parameters can be prevented, and the recognition rate can be improved.
[0065]
Embodiment 13 FIG.
FIG. 11 is a block diagram showing a thirteenth embodiment which is one configuration example of the speaker adaptation apparatus according to the thirteenth aspect of the present invention. In FIG. 11, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that a recognition result reliability comparing means 1101 is added in front of the unsupervised speaker adaptation means 102 with recognition result reliability.
[0066]
Next, the operation will be described with reference to FIG. The recognition result reliability comparing means 1101 inputs the recognition result reliability output from the recognition result reliability calculating means 101, and if the recognition result reliability is larger than a predetermined threshold, the unsupervised talk with the recognition result reliability. The processing is performed by the user adaptation means 102. On the other hand, if the recognition result reliability is smaller than the predetermined threshold, the parameter of the standard pattern is not updated, and the value of the standard pattern 2004 is set as the unsupervised speaker adaptive pattern 2008.
[0067]
For example, if the total recognition result reliability of one utterance is equal to or smaller than the threshold Th, the speaker adaptation apparatus does not update the parameters of the standard pattern using this utterance. Also, the total of the recognition result reliability of the adaptive data divided by the segmentation for each parameter of the standard pattern is calculated, the recognition result reliability of each parameter is compared with the threshold, and if the value is equal to or less than the threshold, the parameter is updated. Instead, the parameter larger than the threshold is the speaker adaptation device that performs the update.
Therefore, according to the speaker adaptation apparatus of the thirteenth embodiment, if the recognition result reliability is equal to or less than the predetermined threshold value, the parameter is not updated, so that the recognition result is incorrect. In this case, erroneous learning of the standard pattern parameters can be prevented, and the recognition rate can be improved.
[0068]
Embodiment 14 FIG.
FIG. 12 is a block diagram showing a fourteenth embodiment which is one configuration example of the speaker adaptation apparatus according to the fourteenth aspect of the present invention. 12, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that, before the unsupervised speaker adaptation means 102 with recognition result reliability, a speaker adaptation method selection means 1201 based on recognition result reliability, and M unsupervised talks with recognition result reliability. This means that the user adaptation means 1202-1 to 1202-M are provided.
[0069]
Next, the operation will be described with reference to FIG. The speaker adaptation method selection means 1201 based on the recognition result reliability inputs the recognition result reliability output from the recognition result reliability calculation means 101 and receives a predetermined method selection threshold [Th (1), Th (2). ,. . . , Th (K)], an unsupervised speaker adaptation method is selected. For example, when the value of the recognition result reliability is S, the unsupervised speaker adaptation method with recognition result reliability 1202-k is selected for Th (k) ≦ Su <Th (k + 1). Here, Su is the total value of the recognition result reliability of one utterance.
[0070]
The unsupervised speaker adaptation means 1202-1 to 1202-M with the recognition result reliability are described in, for example, "A Study on Speaker Adaptation of the Parameters of Continuity Density Hidden Markov Models". H. Lee, C.I. H. Lin, B .; H. Jiang, IEEE TRANSACTION ONSIGNAL PEOCESSING, Vol. 39, No. 4, 1991 (hereinafter referred to as reference 5), the maximum posterior probability estimation method is 1202-1, "Moving vector field smoothing speaker adaptation method using continuous mixture distribution HMM", Okura, Sugiyama, Saga Yama, 1202-2, a moving vector field smoothing speaker adaptation method proposed in IEICE Technical Report, SP92-16, 1992 (hereinafter referred to as reference 6), speaker adaptation based on multiple regression mapping model The method (Reference 3) can be configured as 1202-3.
Therefore, according to the speaker adaptation apparatus of the fourteenth embodiment, even if the speaker recognition learning speech recognition result is erroneous, erroneous learning of the standard pattern parameters can be prevented by such a configuration. And the recognition rate can be improved.
[0071]
Embodiment 15 FIG.
FIG. 13 is a block diagram showing a fifteenth embodiment which is one configuration example of the speaker adapting apparatus according to the fifteenth aspect of the present invention. In FIG. 13, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means with recognition result reliability is configured by the standard pattern parameter clustering means 1301 and the parameter group with recognition result reliability unsupervised speaker adaptation means 1302. is there.
[0072]
Next, the operation will be described with reference to FIG. The standard pattern parameter grouping means 1301 groups the standard pattern parameters stored in the standard pattern 2004 by clustering. If the standard pattern is an HMM, a Gaussian distribution [g (1), g (2),. . , G (Mg)] (Mg is the total number of Gaussian distributions), for example, the distance dv (g (i), g (j)) between the Gaussian distributions g (i) and g (j) by the Batacharya distance in Equation 16 After performing the definition and clustering, the groups G (x) = [g (x (1)), g (x (2)),. . . , G (x (n))] (where x (.) Is the distribution number). The clustering method is performed using, for example, the K-means method described in Document 1.
The parameter group with unrecognized result reliability unsupervised speaker adaptation means 1302 compares the standard pattern parameter group output from the standard pattern parameter grouping means 1301 and the recognition result reliability output from the recognition result reliability calculation means 101. Input and calculate the fluctuation amount of the standard pattern parameter for each group. For example, the moving amount of the p-th dimension of the average vector when the standard pattern is the HMM is calculated by Expression 17. In Expression 17, α (x) is a weight coefficient determined by the reliability shown in Expression 18. Ψx is a set of Gaussian distribution numbers of the parameter group x, Ωi is a set of adaptive data times of the Gaussian distribution number i, and σ2 (i, p) is a p-th row and a p-th column of a covariance matrix of the Gaussian distribution number i. is there. In Expression 18, Sf (t) is the recognition result reliability of the frame t, and τ is a control constant having a value of 0 or more.
Further, it is also possible to calculate the common movement amount v (x, p) of the average vector of the group x by using Expression 19. In Expression 19, oh (t) is adaptive data weighted by the recognition result reliability shown in Expression 12, and γ (i, t) is a speech feature o (t) in Gaussian distribution i at time t. However, the weighted audio feature value oh (t) may be calculated as an expected value.
(Equation 16)

[Equation 17]

(Equation 18)

[Equation 19]

Therefore, according to the speaker adaptation apparatus of the fifteenth embodiment, even if the recognition result is incorrect, erroneous learning of the standard pattern parameters can be prevented, and the recognition rate can be reduced. Can be improved.
[0073]
Embodiment 16 FIG.
FIG. 14 is a block diagram showing a sixteenth embodiment which is one configuration example of the speaker adaptation apparatus according to the sixteenth aspect of the present invention. 14, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means 102 with the recognition result reliability includes a standard pattern parameter tree structure clustering means 1401, a standard pattern parameter grouping means 1402 based on tree structure parameters, and a recognition result. And a parameter group with reliability and an unsupervised speaker adaptation means 1302.
[0074]
Next, the operation will be described with reference to FIG. The standard pattern parameter tree structure clustering means 1401 clusters the standard pattern parameters into a tree structure based on, for example, a Batacharya distance shown in Expression 16. In tree structure, first, all parameters are grouped into N parameter groups [G (1,1,1), G (1,1,2),. . . ,, G (1,1, N)] (G (i, j, k)): i is a cluster, j is a parent group number, and k is a group number.
Next, as clustering of the second layer, G (1, m1, n1) is represented by [G (2, n1, 1), G (2, n1, 1),. . . , G (2, n1, Nn1)].
Further, as a third layer, G (2, m2, n2) is represented by [G (3, n2, 1), G (3, n2, 1),. . . , G (3, n2, Nn2)]. Thus, clustering is performed up to a predetermined hierarchy. The standard pattern parameter grouping means 1402 based on the tree structure parameters groups the parameters based on the tree structured parameters which are the outputs of the standard parameter tree structure clustering according to the recognition result reliability of the output of the recognition result reliability calculation means 101. Become
[0075]
FIG. 15 is an explanatory diagram of the grouping of the tree structuring parameters based on the recognition result reliability. The sum of the recognition result reliability of the adaptive data of the parameters belonging to the nodes and below is calculated as the node information. When the recognition result reliability of the child node is smaller than a predetermined threshold th and the recognition result reliability of the parent node is not less than th, the parameter group below the parent node is used for estimating parameters belonging to the child node and below. In FIG. 15, the number in parentheses is the recognition result reliability of the adaptive data of the parameters below the node. For example, if th is 40, the reliability is 20 for Node (3, 1) and 100 for Node (2, 1) of its parent node. Using the adaptation data, the recognition result reliability, and the standard pattern parameters, a common variation is obtained for the parameters, and the parameters below Node (3, 2) are updated. The parameter-without-recognition-unsupervised speaker adaptation means 1302 for calculating the parameter variation of the parameter group obtains and updates the variation in the parameter group as described in the fifteenth embodiment.
Therefore, according to the speaker adapting apparatus of the seventeenth embodiment, even if the recognition result is erroneous, erroneous learning of the standard pattern parameters can be prevented by the configuration described above, and the recognition rate can be reduced. Can be improved.
[0076]
Embodiment 17 FIG.
The speaker adaptation apparatus according to the seventeenth embodiment is characterized in that a continuous mixture distribution type hidden Markov model is used as a standard pattern. The details of the continuous mixture distribution type hidden Markov model are described in Document 1, and thus the description thereof is omitted.
[0077]
Embodiment 18 FIG.
In the speaker adaptation apparatus according to the eighteenth embodiment, the element distribution function constituting the symbol output probability density function of the continuous mixture distribution type hidden Markov model is a Gaussian distribution. It is a speaker adaptation device. The Gaussian distribution function is given by Equation 20. In Expression 20, μ (i) and C (i) are a mean vector and a covariance matrix of Gaussian distribution i. Also, d is the number of dimensions of the average vector, and o is the feature amount vector.
(Equation 20)

[0078]
Embodiment 19 FIG.
The speaker adaptation apparatus according to the seventeenth embodiment is characterized in that the parameter to be adapted is a mean vector of Gaussian distribution.
[0079]
Embodiment 20 FIG.
FIG. 16 is a block diagram showing Embodiment 20 which is one configuration example of the speaker adaptation apparatus according to the twentieth aspect. In FIG. 16, the same functional blocks as in the first embodiment and the seventh embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic feature of the present invention is that the unsupervised speaker adaptation means with recognition result reliability 102 comprises a standard pattern parameter updating means with recognition result reliability 702 and a standard pattern parameter interpolation means 1601.
[0080]
Next, the operation will be described with reference to FIG. The parameter updating unit with recognition result reliability 702 updates the average value of the Gaussian distribution by updating the parameters described in the eighth and tenth embodiments. The parameter interpolating means 1601 calculates the average vector of the Gaussian distribution for which no adaptive learning data exists by using the difference vector before and after the updating of the average vector of the Gaussian distribution learned by the parameter updating means with recognition result reliability 702 according to Equation 21. Interpolate.
(Equation 21)

[0081]
FIG. 17 is a conceptual diagram of interpolation of a Gaussian distribution average value. In FIG. 17, μ (1), μ (2), and μ (3) are mean vectors of a Gaussian distribution in which adaptive data exists, and μa (1), μa (2), and μa (3) are unsupervised speakers. It is the average vector after updating by adaptation. Μ (4) is an average vector having no adaptive data. For μ (4) for which there is no adaptation data, interpolation is performed using the difference vector before and after the update of the neighboring average vector according to Expression 21. In Equation 21, μ (q) and μa (q) are average vectors before and after the q-th update, αp, q are weighting factors, TV (p) is a difference vector (moving vector) between the average vectors before and after the update, and P is This is a set of neighborhood average vectors used for interpolation. Further, f is a control constant, dp and q are Mahalanobis distances, C (q) is a covariance matrix of Gaussian distribution q, and superscript −1 represents an inverse matrix.
Therefore, according to the speaker adapting apparatus of the twentieth embodiment, the mean vector of the Gaussian distribution without such adaptive data is interpolated by the difference vector of the mean vector of the Gaussian distribution with the adaptive data. Since adaptation is performed, even if the recognition result is incorrect, erroneous learning of the parameters of the standard pattern can be prevented, and the recognition rate can be improved.
[0082]
Embodiment 21 FIG.
FIG. 18 is a block diagram showing Embodiment 21 which is one configuration example of the speaker adapting apparatus according to the twenty-first aspect of the present invention. In FIG. 18, the same functional blocks as those in the first embodiment are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means with recognition result reliability 102 is a speaker adaptation means 1801 based on a multiple regression mapping model with recognition result reliability.
[0083]
Next, the operation will be described with reference to FIG. The speaker adaptation means 1801 based on the multiple regression mapping model with recognition result reliability inputs the recognition result reliability output from the recognition result reliability calculation means 101, the speech recognition result 2006 for speaker adaptation learning, and the standard pattern 2004. Then, the average vector of the Gaussian distribution is updated by a linear transformation based on the multiple regression mapping model of Expression 1. For A and v in Expression 1, the elements of the p-th row of A and the p-dimensional element of v are obtained by Expression 22 using the adaptive data oh (t) weighted by the recognition result reliability shown in Expression 12. In Expression 22, oh (t, p) is a p-dimensional element of the adaptive data oh (t) weighted by the recognition result reliability, and the other variables are the same as Expression 2. Γ (i, t) is an expected value at which the voice feature o (t) exists in the Gaussian distribution i at the time t, but is calculated as an expected value at which the weighted voice feature oh (t) exists. May be.
(Equation 22)

[0084]
Further, the speaker adapting means 1801 based on the multiple regression mapping model with the recognition result reliability updates the average vector by using

Equations

1 and 2 as in the case of the conventional speaker adaptation using the multiple regression mapping model, and updates μa ■ (q). And μa この (q) may be set as μm (q) in Expression 8 to perform linear interpolation based on the recognition result reliability.
Therefore, according to the speaker adaptation apparatus of the twenty-first embodiment, since the unsupervised speaker adaptation based on the multiple regression mapping model with the recognition result reliability is performed, the recognition result for the speaker adaptation learning is incorrect. In this case, erroneous updating of parameters can be prevented, and the recognition rate is improved.
[0085]
Embodiment 22 FIG.
FIG. 19 is a block diagram showing Embodiment 22 which is an example of the configuration of the speaker adaptation apparatus according to the present invention. In FIG. 19, the same functional blocks as those in the first, fifteenth, and eighteenth embodiments are denoted by the same reference numerals, and description thereof will be omitted. A characteristic part of the present invention is that the unsupervised speaker adaptation means 102 with recognition result reliability is composed of a Gaussian distribution grouping means 1901 and a speaker adaptation means 1801 based on a multiple regression mapping model with recognition result reliability. It is to be.
[0086]
Next, the operation will be described with reference to FIG. The Gaussian distribution grouping unit 1901 groups the Gaussian distributions of the standard pattern 2004 by clustering, and performs the recognition result reliability described in Embodiment 21 for each group based on the recognition result reliability of the adaptive data of the Gaussian distribution in the group. We perform speaker adaptation based on the multiple regression mapping model.
Therefore, according to the speaker adaptation apparatus of the twenty-second embodiment, the unsupervised speaker adaptation is performed based on the multiple regression mapping model with the recognition result reliability by grouping the standard patterns in this way. It is possible to prevent erroneous updating of parameters when the recognition result for adaptive learning is wrong, and the recognition rate is improved.
[0087]
Embodiment 23 FIG.
FIG. 20 shows a speech recognition apparatus according to the twenty-third aspect of the present invention, that is, a speech recognition apparatus using the unsupervised speaker adaptation pattern 2008 updated by the unsupervised speaker adaptation apparatus according to the first to twenty-second embodiments. FIG. 35 is a block diagram showing a configuration of a twenty-third embodiment. In FIG. 20, the same components as those of the speaker adaptation device shown in FIG. 1 and the like are denoted by the same reference numerals, and description thereof will be omitted.
[0088]
The words to be recognized set by the recognition dictionary 2005 [W (1), W (2),. . . , W (wn)] into a recognition unit label, and concatenates unsupervised speaker adaptation patterns 2008 according to the label to create a standard pattern of the recognition target word. Using the standard pattern of the recognition target word, the voice feature amount output from the voice feature amount extraction unit 2002 is collated, and a voice recognition result 2101 is output. At this time, the input speech 2001 may be the same as the word used for unsupervised adaptation, or may be another word. The speech recognition result 2101 is a text notation Rw = [W (r (1)), W (r (2 )),. . . , W (r (m))]. Here, r (i) indicates the recognition dictionary word number of the i-th word in the word time series of the speech recognition result. M indicates the number of words in the recognized word sequence.
Therefore, according to the speech recognition apparatus of the twenty-third embodiment, speech recognition is performed using the unsupervised speaker adaptation pattern 2008 obtained by performing unsupervised speaker adaptation with recognition result reliability. It is possible to prevent erroneous updating of parameters when the recognition result for speaker adaptive learning is wrong, and the recognition rate is improved.
[0096]
According to the next invention, the unsupervised speaker adaptation means with recognition result reliability estimates the parameters of the standard pattern by maximum likelihood estimation using the divided adaptation data for updating the standard pattern parameters of the recognition unit. Based on the total value of the recognition result reliability of the adaptive data used for updating the parameters of the standard pattern of the recognition unit, from the standard pattern parameters to the parameters of the speaker adaptive pattern by linear interpolation of the values of the parameters before and after the maximum likelihood estimation. Since updating is performed, even if the result of speech recognition for adaptive learning is erroneous, erroneous updating of the standard pattern parameters can be prevented, and the recognition rate is improved.
[0098]
According to the next invention, the unsupervised speaker adaptation means with recognition result reliability uses the divided adaptation data for updating the parameters of the standard pattern of the recognition unit, and learns the parameter of the adaptation data based on the recognition result reliability. Is calculated, and the standard pattern parameters are updated to the parameters of the speaker adaptation pattern by the weighted adaptation data.Therefore, even when the speech recognition result for adaptive learning is incorrect, the parameter of the standard pattern is incorrect. Updates can be prevented, and the recognition rate improves.
[0102]
Further, according to the next invention, the unsupervised speaker adaptation means with recognition result reliability switches the updating method according to the value of the recognition result reliability of the first utterance. However, erroneous updating of the standard pattern parameters can be prevented, and the recognition rate is improved.
[0103]
Further, according to the next invention, the parameters of the standard pattern are grouped by clustering, and the variation of the parameter common to the group is calculated using the divided adaptive data for updating the parameters in the group and the recognition result reliability. Since the parameter of the group of the standard pattern is updated to the parameter of the group of the speaker adaptive pattern according to the parameter variation, even if the speech recognition result for adaptive learning is incorrect, the parameter of the standard pattern is incorrectly updated. Can be prevented and the recognition rate is improved.
[0104]
According to the next invention, the clustering performs tree structure clustering to cluster standard pattern parameters in a tree structure, and recognizes divided adaptive data for updating a parameter of a standard pattern belonging to a node below the tree structure. Calculate the amount of parameter variation common to the group using the divided adaptive data for updating the parameters in the group and the recognition result reliability using the standard pattern parameters below the node whose result reliability is equal to or greater than the threshold as a group. Since the parameter of the group of the standard pattern is updated to the parameter of the group of the speaker adaptive pattern according to the amount of variation, even if the result of speech recognition for adaptive learning is incorrect, erroneous updating of the parameter of the standard pattern is prevented. The recognition rate can be improved.
[0105]
According to the next invention, a continuous mixture distribution type hidden Markov model is used as the standard pattern and the speaker adaptation pattern. Therefore, even when the speech recognition result for adaptive learning is incorrect, the parameter of the standard pattern is incorrectly updated. Can be prevented and the recognition rate is improved.
[0106]
Further, according to the next invention, the element distribution function constituting the symbol output probability density function of the continuous mixture distribution type hidden Markov model is a Gaussian distribution, so that even if the speech recognition result for adaptive learning is incorrect, the standard pattern Incorrect updating of parameters can be prevented, and the recognition rate is improved.
[0107]
According to the next invention, since the parameter to be updated in the unsupervised speaker adaptation means with recognition result reliability is the average vector of the Gaussian distribution, even if the speech recognition result for adaptive learning is incorrect, the parameter of the standard pattern Erroneous updating can be prevented, and the recognition rate is improved.
[0108]
According to the next invention, the mean vector of the Gaussian distribution is updated by updating the mean vector of the Gaussian distribution in which the adaptive data exists, with the recognition result reliability, and the mean vector of the Gaussian distribution in which the adaptive data does not exist is the adaptive data. The parameters of the standard pattern are updated to the parameters of the speaker adaptation pattern by interpolation using the difference vector of the average vector values before and after the update of the Gaussian distribution in which the speaker learning adaptive recognition speech recognition result is incorrect. In addition, erroneous updating of the standard pattern parameters can be prevented, and the recognition rate is improved.
[0109]
Further, according to the next invention, the unsupervised speaker adaptation means with recognition result reliability, by speaker adaptation based on a multiple regression mapping model using the recognition result reliability, averages the Gaussian distribution which is a parameter of the standard pattern. Since the vector is updated to the average vector of the Gaussian distribution of the speaker adaptation pattern, even if the result of speech recognition for adaptive learning is incorrect, erroneous updating of the standard pattern parameters can be prevented, and the recognition rate improves.
[0110]
According to the next invention, the speaker adaptation based on the multiple regression mapping model is performed by clustering the Gaussian distribution of the standard pattern and grouping the Gaussian distribution based on the Gaussian distribution based on the adaptive data for updating the Gaussian distribution in the group and the recognition result reliability. One regression coefficient is calculated for the group of distributions, and the average vector of the standard pattern is updated to the average vector of the speaker adaptation pattern using the regression coefficient. Incorrect updating of parameters can be prevented, and the recognition rate is improved.
[0111]
According to the next invention, an unsupervised speaker adaptation pattern updated by the speaker adaptation apparatus according to any one of claims 1 to 22, and a speech for extracting a speech feature from an input speech of the speaker. A speech recognition unit for adaptive learning, comprising: a feature amount extraction unit; and a matching unit that compares a speech feature amount extracted by the speech feature amount extraction unit with the unsupervised speaker adaptation pattern and outputs a recognition result. Even if the result is incorrect, incorrect update of the standard pattern parameters can be prevented, and the recognition rate is improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first embodiment of a speaker adaptation apparatus according to the present invention.
FIG. 2 is a block diagram showing a configuration of a speaker adaptation apparatus according to a second embodiment of the present invention;
FIG. 3 is an operation explanatory diagram of Embodiment 3 of the speaker adaptation apparatus according to the present invention;
FIG. 4 is an operation explanatory diagram of a speaker adaptation apparatus according to a fourth embodiment of the present invention;
FIG. 5 is an operation explanatory diagram of Embodiment 5 of the speaker adaptation apparatus according to the present invention.
FIG. 6 is an operation explanatory diagram of Embodiment 6 of the speaker adaptation apparatus according to the present invention.
FIG. 7 is a block diagram showing a configuration of a seventh embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 8 is a block diagram showing a configuration of a speaker adaptation apparatus according to an eighth embodiment of the present invention;
FIG. 9 is a block diagram showing a configuration of a tenth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 10 is a block diagram showing a configuration of a twelfth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 11 is a block diagram showing a configuration of a thirteenth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 12 is a block diagram showing a configuration of a fourteenth embodiment of the speaker adaptation apparatus according to the present invention;
FIG. 13 is a block diagram showing a configuration of a fifteenth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 14 is a block diagram showing a configuration of a sixteenth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 15 is an operation explanatory diagram of Embodiment 16 of the speaker adaptation apparatus according to the present invention.
FIG. 16 is a block diagram showing a configuration of a twentieth embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 17 is an operation explanatory diagram of Embodiment 20 of the speaker adaptation apparatus according to the present invention.
FIG. 18 is a block diagram showing a configuration of a twenty-first embodiment of the speaker adaptation apparatus according to the present invention.
FIG. 19 is a block diagram showing a configuration of a speaker adaptation apparatus according to a twenty-second embodiment of the present invention.
FIG. 20 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 23 of the present invention.
FIG. 21 is a block diagram illustrating a configuration of a conventional speaker adaptation apparatus.
[Explanation of symbols]
101 Recognition result reliability calculation means
102 Unsupervised speaker adaptation means with recognition result reliability
701 Audio data segmentation means
702 Standard pattern parameter updating means with recognition result reliability
801 Standard pattern parameter maximum likelihood estimation means
802 Parameter linear interpolation means based on recognition result reliability
901 Adaptive learning means using recognition result weighted learning data
1001 Multiple recognition result candidate output collation means
1002 Multiple recognition result candidate reliability calculation means
1003 Unsupervised speaker adaptation means with multiple recognition result candidate reliability
1101 Recognition result reliability comparison means
1201 Speaker adaptation method selection means based on recognition result reliability
1202-1 -M Unsupervised speaker adaptation means with recognition result reliability 1 -M
1301 Standard pattern parameter clustering means
1302 Parameter group with recognition result reliability unsupervised speaker adaptation means
1401 Standard pattern parameter tree structure clustering means
1402 Standard Pattern Parameter Grouping Means Based on Tree Structured Parameters
1601 Standard pattern parameter interpolation means
1801 Speaker adaptation based on multiple regression mapping model with recognition result reliability
1901 Gaussian distribution grouping means
2001 input voice
2002 Voice feature extraction means
2003 collation means
2004 Standard pattern
2005 recognition dictionary
2006 Speech recognition result for speaker adaptive learning
2007 Unsupervised speaker adaptation
2008 Unsupervised speaker adaptation pattern
2101 Speech recognition result

Claims

A speech feature extracted from a speaker's input speech is compared with a standard pattern obtained by performing parameter learning based on speech data of a large number of speakers, and a recognition result is output. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
A standard pattern parameter maximum likelihood estimating means for calculating an estimated adaptive pattern from the voice features and the standard pattern by maximum likelihood estimation,
The speaker adaptive pattern is obtained by linearly interpolating the values of the parameters constituting the estimated adaptive pattern calculated by the standard pattern parameter maximum likelihood estimating means and the values of the parameters constituting the standard pattern according to the reliability. Parameter linear interpolation means for calculating
A speaker adaptation device comprising:

A speech feature extracted from a speaker's input speech is compared with a standard pattern obtained by performing parameter learning based on speech data of a large number of speakers, and a recognition result is output. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
Based on the reliability, calculate the weight to the parameter learning of the speaker adaptation data obtained from the input speech, using the weighted speaker adaptation data, the parameters constituting the standard pattern, A speaker adaptation apparatus, comprising: an adaptive learning unit that updates parameters constituting the speaker adaptation pattern .

The voice feature extracted from the input voice of the speaker and a standard pattern obtained by performing parameter learning based on voice data of many speakers are collated and a recognition result is output, and the standard pattern is output from the input pattern. In a speaker adaptation device that determines whether to update to a speaker adaptation pattern adapted to a speaker that has emitted a voice according to the reliability of the recognition result,
A speaker adaptation apparatus, comprising: speaker application method selection means for selecting a different speaker application learning algorithm based on a value of reliability of the recognition result output in the past.