JP5982297B2

JP5982297B2 - Speech recognition device, acoustic model learning device, method and program thereof

Info

Publication number: JP5982297B2
Application number: JP2013028984A
Authority: JP
Inventors: 陽太郎久保; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-18
Filing date: 2013-02-18
Publication date: 2016-08-31
Anticipated expiration: 2033-02-18
Also published as: JP2014157323A

Description

本発明は、ニューラルネットワークに基づく音響モデルを用いた音声認識技術及びその音響モデルを学習する技術に関する。 The present invention relates to a speech recognition technique using an acoustic model based on a neural network and a technique for learning the acoustic model.

以下の説明において、テキスト中で使用する記号「^」「~」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 In the following explanation, the symbols “^”, “~”, etc. used in the text should be written immediately above the character immediately after it, but are written immediately before the character due to restrictions on the text notation. . In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜音声認識装置９０＞
音声認識装置９０の処理フローを図１に、機能ブロック図を図２に示す。音声認識装置９０は主に、特徴量抽出部９１と単語列探索部９２とからなる。特徴量抽出部９１は、フレーム(音声波形から一定時間長を切り出した波形)ｔ毎に音声信号データ（認識用音声データ）を時系列の特徴量ベクトルx_tに変換する（ｓ９１）。単語列探索部９２は、音響モデル格納部９３に格納された音響モデルと言語モデル格納部９４に格納された言語モデルとを用いて、特徴量抽出部９１から出力された時系列の特徴量ベクトル（音声特徴量ベクトル）x_tの音響モデルに対するスコア系列と言語モデルに対するスコアとを算出する。さらに、単語列探索部９２は、これらのスコアを参照して時系列の特徴量ベクトルx_tと合致する単語列を探索する（ｓ９２）。音声認識装置９０は、最終的に単語列探索部９２で得られた探索結果である単語列を認識結果として出力する。ここで、音響モデルと言語モデルは、学習データ等を用いて予め作成しておく。ここで、音響モデルの作成方法について説明する。 <Voice recognition device 90>
A processing flow of the speech recognition apparatus 90 is shown in FIG. 1, and a functional block diagram is shown in FIG. The speech recognition apparatus 90 mainly includes a feature amount extraction unit 91 and a word string search unit 92. The feature quantity extraction unit 91 converts voice signal data (recognition voice data) into a time-series feature quantity vector x _t for each frame (waveform obtained by cutting out a predetermined time length from the voice waveform) t (s91). The word string search unit 92 uses the acoustic model stored in the acoustic model storage unit 93 and the language model stored in the language model storage unit 94 to output a time-series feature amount vector output from the feature amount extraction unit 91. calculating a score for the score line and a language model for the acoustic model (voice feature vector) x _t. Further, the word string search unit 92 searches for a word string matching the feature vector x _t time series with reference to these scores (s92). The speech recognition apparatus 90 outputs a word string that is a search result finally obtained by the word string search unit 92 as a recognition result. Here, the acoustic model and the language model are created in advance using learning data or the like. Here, a method for creating an acoustic model will be described.

［音響モデルについて］
音響モデルは音声の持つ音響的特徴をモデル化したものであり、認識用音声データと音響モデルを参照することにより、音声データを音素や単語といったシンボルに変換する。そのため、音響モデルの作成は音声認識装置の性能を大きく左右する。音声認識装置９０では特徴量抽出部９１を用いて、音声データを{x₁,x₂,…，x_t,…}(x_t∈R^D、Rは実数の集合、tはフレーム番号またはそのフレーム番号に対応する時刻)のようなD次元の特徴量ベクトルx_tの系列に変換する。通常、音声認識用音響モデルでは、各音素とこの特徴量ベクトルx_tの系列の関係をLeft-to-right型の隠れマルコフモデル(Hidden Markov Model:以下「HMM」ともいう)で表現する。 [Acoustic model]
The acoustic model is a model of acoustic features of speech, and the speech data is converted into symbols such as phonemes and words by referring to the recognition speech data and the acoustic model. For this reason, the creation of the acoustic model greatly affects the performance of the speech recognition apparatus. The speech recognition apparatus 90 uses the feature quantity extraction unit 91 to convert the speech data into {x ₁ , x ₂ ,..., X _t , ...} (x _t ∈R ^D , R is a set of real numbers, and t is a frame number or its converted into D feature quantity vector x _t series of dimensions, such as time) corresponding to the frame number. Usually, in the acoustic model for speech recognition, the relationship between each phoneme and the sequence of the feature vector xt is _expressed by a Left-to-right type hidden Markov model (hereinafter also referred to as “HMM”).

これらのモデルでは時系列の特徴量ベクトルx_tは、状態変数の系列{s₁,s₂,…,s_t,…}が一次のマルコフ連鎖に従って遷移し、その状態変数s_tに依存した確率分布からサンプル(出力)されたものとしてモデル化される。そのため、実際に音響モデルとしてメモリに記録されている情報は、状態遷移確率行列Pと、出力分布関数パラメタΛの二種類に分割することができる。ここでは状態遷移確率行列Pは既知の行列であるとし、出力分布関数パラメタΛを学習する場合の構成を説明する。 The feature vector x _t of the time series of these models, a sequence of state variables _{_{{s 1, s 2, ...}} , s t, ...} is shifted in accordance with an order Markov chain, depending on its state variables s _t probability Modeled as sampled (output) from the distribution. Therefore, the information actually recorded in the memory as an acoustic model can be divided into two types: a state transition probability matrix P and an output distribution function parameter Λ. Here, it is assumed that the state transition probability matrix P is a known matrix, and a configuration in the case of learning the output distribution function parameter Λ will be described.

出力分布は一般的に混合ガウス分布、もしくは、ニューラルネットワーク(以下、「NN」ともいう)で表現され、Λはそれらのパラメタである。Λを混合ガウス分布で表現した音響モデルを混合ガウス分布音響モデル、ΛをNNで表現した音響モデルをNN音響モデルと呼ぶこととする。 The output distribution is generally expressed by a mixed Gaussian distribution or a neural network (hereinafter also referred to as “NN”), and Λ is a parameter thereof. An acoustic model expressing Λ with a mixed Gaussian distribution is called a mixed Gaussian distribution acoustic model, and an acoustic model expressing Λ with an NN is called an NN acoustic model.

[NN音響モデルについて]
NN音響モデルは、状態変数がs_tのとき、特徴量ベクトルx_tが出力される確率をNNパラメタΛを用いて、以下のように定義する。 [NN acoustic model]
NN acoustic model, when the state variable is s _t, the probability that a feature vector x _t is output using the NN parameter lambda, defined as follows.

ここで、分母のp(s_t)は、状態変数s_tの出現確率を表し、学習データ中の状態変数sの出現頻度をカウントすることによって予め計算しておくこととする。例えば、出現確率p(s)は、学習データ中の全ての状態変数の出現頻度の総和に対する各状態変数ｓの出現頻度の割合である。また、学習データのデータ量が十分にない場合などは一定値であると仮定してもよい。 Here, the denominator of p (s _t) denotes the probability of occurrence of state variables s _t, and the advance calculated by counting the frequency of occurrence of the state variable s in the training data. For example, the appearance probability p (s) is the ratio of the appearance frequency of each state variable s to the sum of the appearance frequencies of all the state variables in the learning data. In addition, when the amount of learning data is not sufficient, it may be assumed to be a constant value.

分子のp(s_t|x_t,Λ)は、多層パーセプトロンと呼ばれるNNの一種(非特許文献１)を用いて以下のように定義される。 The p (s _t | x _t , Λ) of a molecule is defined as follows using a kind of NN called a multilayer perceptron (Non-patent Document 1).

i=1,2,…,L、Lはレイヤー数、H⁽ⁱ⁾はi番目のレイヤーにあるユニットの数、h⁽ⁱ⁾ _j(x_t；Λ)は入力に特徴量ベクトルx_tが与えられたときのi番目のレイヤー内のj番目のユニットの状態を示す実数である。また、便宜上H⁽⁰⁾はDであるとし、h⁽⁰⁾ _j(x_t；Λ)はh⁽⁰⁾ _j(x_t；Λ)=x_t,j、すなわち特徴量ベクトルx_tのj番目の要素とする。このNN音響モデルにおいて、学習前に予め決めておくハイパーパラメタは、レイヤー数Lと、各レイヤー内のユニット数H⁽ⁱ⁾である。残りの自由変数、すなわち、結合行列 i = 1,2, ..., L, L is the number of layers, H ⁽ⁱ⁾ is the number of units in the i-th layer, h ⁽ⁱ⁾ _j (x _t ; Λ) is the feature vector x _{t at the} input A real number indicating the state of the jth unit in the ith layer at a given time. For convenience, H ⁽⁰⁾ is D, and h ⁽⁰⁾ _j (x _t ; Λ) is h ⁽⁰⁾ _j (x _t ; Λ) = x _{t, j} , that is _{, j of} the feature vector x _t Let the element. In this NN acoustic model, hyperparameters determined in advance before learning are the number of layers L and the number of units H ⁽ⁱ⁾ in each layer. The remaining free variables, ie the coupling matrix

及びバイアスベクトル And bias vector

を、以降Λ＝｛W⁽ⁱ⁾,b⁽ⁱ⁾|^∀i｝というように、NNパラメタΛで表わす。 Is represented by the NN parameter Λ such that Λ = {W ⁽ⁱ⁾ , b ⁽ⁱ⁾ | ^∀ i}.

[音響モデルの作成について]
音響モデルの作成は確率統計的手法により、与えられた学習データから得られる複数の特徴量ベクトルx_tの系列X⁽ⁿ⁾の群（以下、「学習用特徴量系列群」ともいう）X={X⁽¹⁾,X⁽²⁾,…,X⁽ⁿ⁾,…}と、学習データの複数の状態変数s_tの系列s⁽ⁿ⁾の群（以下、「学習用状態変数系列群」ともいう）S={s⁽¹⁾,s⁽²⁾,…,s⁽ⁿ⁾,…}とから、パラメタΛを推定することにより作成される。ここでnは発話のインデックスであり、X⁽ⁿ⁾は一つの発話(例えば一文)の音響的特徴を記述した時系列であり、 [About creating acoustic models]
The acoustic model is created by a probabilistic statistical method using a group of a series X ⁽ⁿ⁾ of a plurality of feature vectors x _t obtained from given learning data (hereinafter also referred to as “learning feature sequence group”) X = {X ⁽¹⁾ , X ⁽²⁾ ,…, X ⁽ⁿ⁾ ,…} and a group of sequences s ⁽ⁿ⁾ of a plurality of state variables s _t of learning data (hereinafter referred to as “learning state variable sequence group”) (Also called) S = {s ⁽¹⁾ , s ⁽²⁾ ,..., S ⁽ⁿ⁾ ,. Where n is an utterance index, and X ⁽ⁿ⁾ is a time series describing the acoustic features of one utterance (for example, a sentence),

のように、複数の音声特徴量ベクトルの時系列として表わされる。同様にs⁽ⁿ⁾も、X⁽ⁿ⁾と同じ系列長を持つラベル系列であり、 In this way, it is expressed as a time series of a plurality of speech feature amount vectors. Similarly, s ⁽ⁿ⁾ is a label sequence having the same sequence length as X ⁽ⁿ⁾ ,

のように、複数の状態変数の時系列として表わされる。ラベル系列に関しては確率的に取り扱う場合もあるが、ここではラベル系列は既知として扱う。ただし、本発明自体はこれが確率的に与えられていてもそのまま適用可能である。 In this way, it is expressed as a time series of a plurality of state variables. The label sequence may be handled probabilistically, but here the label sequence is treated as known. However, the present invention itself can be applied as it is even if it is given probabilistically.

これらの学習データが与えられた上で、最適な音響モデルパラメタ^Λは、例えば以下のような学習データへの適合率Ｆ（Λ,X,S）が最大となる音響モデルパラメタΛとして定義される。 Given these learning data, the optimal acoustic model parameter ^ Λ is defined as the acoustic model parameter Λ that maximizes the matching rate F (Λ, X, S) to the learning data, for example, as follows. The

この最適化はバックプロパゲーション法（非特許文献１参照）にて実行することができる。 This optimization can be performed by the back propagation method (see Non-Patent Document 1).

［Minimum Error Linear Transformation（以下「MELT」ともいう］
高精度に音声認識を行うためには、認識対象と同一の話者の認識時と同じ環境（雑音や残響などの周囲の環境）で収録した学習データを用いて学習した音響モデルを用いることが望ましい。しかしながら、話者及び環境毎に音響モデルを作成するのは困難であるため、認識対象とは異なる話者や異なる環境で収録した学習データから学習した音響モデルを用いて音声認識を行うのが一般的である。認識対象とは異なる話者／異なる環境で収録された学習データから学習した音声認識モデルを用いたときの音声認識精度を向上させる技術として、学習済みの音響モデルを認識対象の話者及び認識時の環境に適応するよう補正する適応技術が知られている。 [Minimum Error Linear Transformation (hereinafter also referred to as “MELT”)
In order to perform speech recognition with high accuracy, it is necessary to use an acoustic model learned using learning data recorded in the same environment (the surrounding environment such as noise and reverberation) when the same speaker as the recognition target is recognized. desirable. However, since it is difficult to create an acoustic model for each speaker and environment, it is common to perform speech recognition using an acoustic model learned from a speaker that is different from the recognition target or from learning data recorded in a different environment. Is. As a technology to improve speech recognition accuracy when using a speech recognition model learned from learning data recorded in a different speaker / different environment from the recognition target, the trained acoustic model and the recognition target speaker There are known adaptation techniques for correcting for the environment.

ニューラルネットワーク音響モデルの適応技術として、MELT(非特許文献２参照)が知られている。MELTを用いたNN音響モデルの環境適応及び／または話者適応では、ある層iに対応する結合重み行列W⁽ⁱ⁾を以下のように変換行列Γと適応前重み行列~W⁽ⁱ⁾を用いて更新する。 MELT (see Non-Patent Document 2) is known as an adaptation technique of a neural network acoustic model. In environment adaptation and / or speaker adaptation of the NN acoustic model using MELT, a connection weight matrix W ⁽ⁱ⁾ corresponding to a certain layer i is converted into a transformation matrix Γ and a pre-adaptation weight matrix ~ W ⁽ⁱ⁾ as follows: Use to update.

この拡張は、変換行列Γが単位行列(Γ=I)のとき、従来のNNに一致する。すなわち、MELTで対象とする音響モデルのパラメタは、従来のNNパラメタΛに加えてΓを考慮したものであると考えることができる。 This extension coincides with the conventional NN when the transformation matrix Γ is a unit matrix (Γ = I). That is, it can be considered that the parameters of the acoustic model targeted by MELT take into account Γ in addition to the conventional NN parameter Λ.

MELTの学習ステップでは、従来のNNと同様の学習データX,Sを用いて以下のようにNNパラメタΛを推定する。 In the learning step of MELT, the NN parameter Λ is estimated as follows using learning data X and S similar to those of the conventional NN.

認識時は、予め同一認識環境または同一話者から収集した適応用データ~X、~Sを用いて、以下の最適化を実行することによって、最適な変換行列^Γを推定する。 At the time of recognition, the optimal transformation matrix ^ Γ is estimated by executing the following optimization using the adaptation data ~ X and ~ S collected in advance from the same recognition environment or the same speaker.

この最適化はNNの学習と同様、最急勾配法などを用いて実行することができる。 This optimization can be performed using the steepest gradient method as in the learning of NN.

D. E. Ramelhart, G. E. Hinton, R. J. Williams, “Learning Representations by Back-Propagating Errors”, Nature, 1986, Vol. 323, pp. 533-536.D. E. Ramelhart, G. E. Hinton, R. J. Williams, “Learning Representations by Back-Propagating Errors”, Nature, 1986, Vol. 323, pp. 533-536. J. Trmal, J. Zelinka, L. M uller, "ON Speaker Adaptive Training of Artificial Neural Networks", Proc. Interspeech, 2010.J. Trmal, J. Zelinka, L. Muller, "ON Speaker Adaptive Training of Artificial Neural Networks", Proc. Interspeech, 2010.

MELTにより話者適応／環境適応を実現するためには、十分な量の適応用データ~X、~Sの収集を行う必要がある。適応用データ~X、~Sの量を減らすための試みとして、Γの取り得る値について制約を加えることも行われているが、それらの試みでも適応用データ~X、~Sを一定量蓄積する必要がある。 In order to realize speaker adaptation / environment adaptation by MELT, it is necessary to collect a sufficient amount of adaptation data ~ X, ~ S. As an attempt to reduce the amount of data for adaptation ~ X, ~ S, restrictions have been put on the possible values of Γ, but even in those attempts, a certain amount of data for adaptation ~ X, ~ S is accumulated. There is a need to.

実際に音声認識を使用する環境下では、適応用データ~X、~Sを予め蓄積しておくことができない場合も多く、これから認識しようとしている一発話分のデータのみを用いて高速に適応する方法に関する要求は高い。しかしながら、NN音響モデルに基づく音声認識装置では、このようなリアルタイムでの適応処理を実現する技術が知られていない。 In an environment where speech recognition is actually used, there are many cases where the adaptation data ~ X and ~ S cannot be stored in advance, and only one utterance data that is going to be recognized is used for fast adaptation. The demand for methods is high. However, in a speech recognition apparatus based on the NN acoustic model, a technique for realizing such real-time adaptive processing is not known.

本発明は、複数の性質の異なるNN音響モデルを用いて、話者の発話様式や利用環境の音響環境の違い(雑音／残響)に素早く適応する音声認識技術を提供することを目的とする。 An object of the present invention is to provide a speech recognition technique that uses a plurality of NN acoustic models having different properties to quickly adapt to a speaker's utterance style and a difference in acoustic environment (noise / reverberation) of the usage environment.

上記の課題を解決するために、本発明の第一の態様によれば、音声認識装置は、言語モデルと潜在クラス毎に異なる複数のニューラルネットワーク音響モデルとが格納される格納部と、入力される音声データから各潜在クラスの重みを推定し、推定した潜在クラス毎の重みと、言語モデルと潜在クラス毎に異なる複数のニューラルネットワーク音響モデルとに基づいて、音声データに対する音声認識を行う。 In order to solve the above problems, according to a first aspect of the present invention, a speech recognition device is provided with a storage unit that stores a language model and a plurality of neural network acoustic models that differ for each latent class, and is input. The weight of each latent class is estimated from the speech data, and speech recognition is performed on the speech data based on the estimated weight for each latent class and a plurality of neural network acoustic models different for each language model and latent class.

上記の課題を解決するために、本発明の他の態様によれば、音響モデル学習装置は、潜在クラス毎に異なる複数の、潜在クラスの起こりやすさを示す潜在クラス事前分布パラメタと、特徴量生成分布のパラメタである特徴量生成分布パラメタと、ニューラルネットワーク音響モデルのニューラルネットワークパラメタと、学習データを観測した上での潜在クラスの起こりやすさを示す潜在クラス分布とが格納される音響モデル格納部と、潜在クラス分布と、学習用音声データの状態変数の系列の群である学習用状態変数系列群と学習用音声データの特徴量の系列の群である学習用特徴量系列群とを用いて、ニューラルネットワークパラメタを更新するニューラルネットワーク学習部と、潜在クラス分布と、入力された学習用音声特徴量とを用いて、特徴量生成分布パラメタを更新する特徴量生成分布学習部と、潜在クラス分布を用いて、潜在クラス事前分布パラメタを更新する潜在クラス事前分布学習部と、ニューラルネットワークパラメタ、特徴量生成分布パラメタ、潜在クラス事前分布パラメタ、入力された学習用状態系列及び学習用音声特徴量を用いて、潜在クラス分布を更新する潜在クラス分布学習部とを含み、ニューラルネットワークパラメタ、特徴量生成分布パラメタ、潜在クラス事前分布パラメタ及び潜在クラス分布の更新が収束するまで、ニューラルネットワーク学習部、特徴量生成分布学習部、潜在クラス事前分布学習部及び潜在クラス分布学習部における処理を繰り返す。 In order to solve the above-described problem, according to another aspect of the present invention, an acoustic model learning device includes a plurality of latent class prior distribution parameters indicating the likelihood of occurrence of a latent class, and feature quantities that are different for each latent class. Acoustic model storage that stores feature quantity generation distribution parameters that are generation distribution parameters, neural network parameters of neural network acoustic models, and latent class distributions that indicate the likelihood of occurrence of latent classes after observation of learning data Part, latent class distribution, learning state variable sequence group that is a group of state variable sequences of learning speech data, and learning feature amount sequence group that is a group of feature amount sequences of learning speech data The neural network learning unit that updates the neural network parameters, the latent class distribution, and the input speech features for learning. A feature quantity generation distribution learning unit for updating the feature quantity generation distribution parameter, a latent class advance distribution learning unit for updating the latent class prior distribution parameter using the latent class distribution, a neural network parameter, and a feature quantity generation distribution parameter. A latent class distribution parameter, a latent class distribution learning unit for updating the latent class distribution using the input learning state sequence and the learning speech feature quantity, and a neural network parameter, a feature quantity generation distribution parameter, a latent Until the update of the class prior distribution parameter and the latent class distribution converges, the processes in the neural network learning unit, the feature amount generation distribution learning unit, the latent class prior distribution learning unit, and the latent class distribution learning unit are repeated.

上記の課題を解決するために、本発明の他の態様によれば、音声認識方法は、言語モデルと潜在クラス毎に異なる複数のニューラルネットワーク音響モデルとが格納部に格納されているものとし、入力される音声データから各潜在クラスの重みを推定し、推定した潜在クラス毎の重みと、言語モデルと潜在クラス毎に異なる複数のニューラルネットワーク音響モデルとに基づいて、音声データに対する音声認識を行う。 In order to solve the above-described problem, according to another aspect of the present invention, the speech recognition method includes a language model and a plurality of neural network acoustic models different for each latent class stored in a storage unit, Estimates the weight of each latent class from the input speech data, and performs speech recognition on the speech data based on the estimated weight for each latent class and a plurality of neural network acoustic models different for each language model and latent class. .

上記の課題を解決するために、本発明の他の態様によれば、音響モデル学習方法は、音響モデル格納部には、潜在クラス毎に異なる複数の、潜在クラスの起こりやすさを示す潜在クラス事前分布パラメタと、特徴量生成分布のパラメタである特徴量生成分布パラメタと、ニューラルネットワーク音響モデルのニューラルネットワークパラメタと、学習データを観測した上での潜在クラスの起こりやすさを示す潜在クラス分布とが格納されるものとし、潜在クラス分布と、学習用音声データの状態変数の系列の群である学習用状態変数系列群と学習用音声データの特徴量の系列の群である学習用特徴量系列群とを用いて、ニューラルネットワークパラメタを更新するニューラルネットワーク学習ステップと、潜在クラス分布と、入力された学習用音声特徴量とを用いて、特徴量生成分布パラメタを更新する特徴量生成分布学習ステップと、潜在クラス分布を用いて、潜在クラス事前分布パラメタを更新する潜在クラス事前分布学習ステップと、ニューラルネットワークパラメタ、特徴量生成分布パラメタ、潜在クラス事前分布パラメタ、入力された学習用状態系列及び学習用音声特徴量を用いて、潜在クラス分布を更新する潜在クラス分布学習ステップとを含み、ニューラルネットワークパラメタ、特徴量生成分布パラメタ、潜在クラス事前分布パラメタ及び潜在クラス分布の更新が収束するまで、ニューラルネットワーク学習ステップ、特徴量生成分布学習ステップ、潜在クラス事前分布学習ステップ及び潜在クラス分布学習ステップにおける処理を繰り返す。 In order to solve the above-described problem, according to another aspect of the present invention, an acoustic model learning method includes a latent class indicating the likelihood of occurrence of a plurality of latent classes in the acoustic model storage unit, which are different for each latent class. A prior distribution parameter, a feature generation distribution parameter that is a parameter of a feature generation distribution, a neural network parameter of a neural network acoustic model, and a latent class distribution indicating the likelihood of occurrence of a latent class upon observation of learning data A learning feature quantity sequence that is a group of a latent class distribution, a state variable series group for learning that is a group of state variables of learning speech data, and a series of feature quantities of the speech data for learning Neural network learning step to update neural network parameters using group, latent class distribution, and input learning A feature quantity generation distribution learning step for updating feature quantity generation distribution parameters using speech feature quantities, a latent class advance distribution learning step for updating latent class prior distribution parameters using latent class distribution, and a neural network parameter A latent class distribution learning step for updating the latent class distribution using the feature quantity generation distribution parameter, the latent class prior distribution parameter, the input learning state sequence and the learning speech feature quantity, and the neural network parameter, the feature Until the update of the quantity generation distribution parameter, the latent class prior distribution parameter, and the latent class distribution converges, the processes in the neural network learning step, the feature amount generation distribution learning step, the latent class prior distribution learning step, and the latent class distribution learning step are repeated.

混合ガウス分布からなる音響モデルを用いた音声認識装置より一般に高い性能を持つと言われているNNからなる音響モデルを用いた音声認識装置において、従来不可能であった話者／環境への高速適応（適応用データを蓄積することなく、発話を処理する毎に適応処理を行うこと）が可能になるという効果を奏する。 High speed to the speaker / environment, which was impossible in the past in a speech recognition device using an acoustic model consisting of NN, which is generally said to have higher performance than a speech recognition device using an acoustic model consisting of a mixed Gaussian distribution There is an effect that adaptation (adaptation processing is performed each time an utterance is processed without accumulating adaptation data) becomes possible.

従来技術の音声認識装置の処理フローを示す図。The figure which shows the processing flow of the speech recognition apparatus of a prior art. 従来技術の音声認識装置の機能ブロック図。The functional block diagram of the speech recognition apparatus of a prior art. 第一実施形態に係る音響モデル学習装置の処理フローを示す図。The figure which shows the processing flow of the acoustic model learning apparatus which concerns on 1st embodiment. 第一実施形態に係る音響モデル学習装置の構成例を示す図。The figure which shows the structural example of the acoustic model learning apparatus which concerns on 1st embodiment. 第一実施形態に係る音声認識装置の処理フローを示す図。The figure which shows the processing flow of the speech recognition apparatus which concerns on 1st embodiment. 第一実施形態に係る音声認識装置の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus which concerns on 1st embodiment. 第一実施形態に係る音声認識装置のシミュレーション結果を示す図。The figure which shows the simulation result of the speech recognition apparatus which concerns on 1st embodiment.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

＜第一実施形態＞
＜第一実施形態のポイント＞
［本実施形態において用いるNN音響モデル］
環境や話者の変化を積極的にモデルに取り入れるため、本実施形態では、状態変数系列sと特徴量系列xの結合確率分布p(x,s)をモデル化することを試みる。ここで、xは音響特徴量の系列{x₁,x₂,…,x_t,…}に対応する確率変数であり、sは状態変数の系列{s₁,s₂,…,s_t,…}に対応するラベル系列である。従来のNN音響モデルでは、結合確率分布p(x,s)は、単一のNNパラメタΛを導入し、p(x,s｜Λ)=p(x|s,Λ)p(s)のようにモデル化されていた。本実施形態では、発話に内在する潜在的な環境／話者の要因を考慮し、以下のように潜在クラスkを用いた混合モデル（以下、潜在クラスモデルと呼ぶ）として定義する。 <First embodiment>
<Points of first embodiment>
[NN acoustic model used in this embodiment]
In order to actively incorporate changes in the environment and speakers into the model, this embodiment attempts to model the joint probability distribution p (x, s) of the state variable series s and the feature quantity series x. Where x is a random variable corresponding to the acoustic feature sequence {x ₁ , x ₂ , ..., x _t , ...}, and s is a state variable sequence {s ₁ , s ₂ , ..., s _t , Is a label series corresponding to. In the conventional NN acoustic model, the joint probability distribution p (x, s) introduces a single NN parameter Λ, and p (x, s | Λ) = p (x | s, Λ) p (s) Was modeled as: In the present embodiment, in consideration of the potential environment / speaker factors inherent in the utterance, it is defined as a mixed model (hereinafter referred to as a latent class model) using the latent class k as follows.

ここでKは予め仮定しておく潜在クラス数である。ここで、潜在クラスとは、潜在的な（直接観測できない）環境／話者の要因によって分類される同質の保有性向を持つグループのことをいう。このような潜在クラスモデルを用いた場合、結合確率は特徴生成確率p(x|k)と、状態変数確率p(s|x,k)と、潜在クラス事前確率p(k)との積であると考えることができる。ここで、時系列内の各要素の独立性を仮定し、各分布関数に異なるパラメタを導入すると、以下の表現を得る。 Here, K is the number of latent classes assumed in advance. Here, a latent class refers to a group having a homogeneous propensity to be classified by potential (not directly observable) environmental / speaker factors. When such a latent class model is used, the connection probability is the product of the feature generation probability p (x | k), the state variable probability p (s | x, k), and the latent class prior probability p (k). You can think of it. Here, assuming the independence of each element in the time series and introducing different parameters to each distribution function, the following expression is obtained.

ここで、Λ_kは潜在クラスkにおけるNNパラメタであり、Θ_kは潜在クラスkにおける特徴量生成分布のパラメタ（以下「特徴量生成分布パラメタ」ともいう）である。式(8)は、状態変数系列sと特徴量系列xとの結合確率分布p(x,s)を、潜在クラスk毎のNN音響モデルから算出される確率p(x,s|k,Λ_k,Θ_k)と当該潜在クラスkの起こりやすさを示す確率p(k)との積の、全ての潜在クラスについての総和として定義することを意味する。 Here, Λ _k is an NN parameter in the latent class k, and Θ _k is a parameter of the feature quantity generation distribution in the latent class k (hereinafter also referred to as “feature quantity generation distribution parameter”). Equation (8) expresses the joint probability distribution p (x, s) between the state variable series s and the feature quantity series x, and the probability p (x, s | k, Λ calculated from the NN acoustic model for each latent class k. _k , Θ _k ) and the probability p (k) indicating the probability of occurrence of the latent class k is defined as the sum of all latent classes.

この潜在クラスモデルにおいて、NNを用いてp(s_t|x_t,Λ_k)を式(2')(NNパラメタΛ_kは潜在クラスkに依存して異なる変数をさす)のように定義する。 In this latent class model, NN is used to _define p (s _t | x _t , Λ _k ) as in equation (2 ') (NN parameter Λ _k refers to a different variable depending on latent class k) .

確率p(x_t|Θ_k)を従来の多変量連続分布関数を用いて定義し、p(k)を多項分布、すなわち確率p(k)=p_kを直接推定するようにパラメトライズすると、このモデルの調整可能なパラメタはNNパラメタΛ_k、特徴量生成分布パラメタΘ_k及びp_kとなる。以降p_kを、潜在クラスの確率を示す（全ての潜在クラスに対する各潜在クラスの起こりやすさを示す）という点から、潜在クラス事前分布または潜在クラス事前分布パラメタと呼ぶ。なお、後述する「潜在クラス分布q_n,k」は、対応するn番目のデータを観測した上での潜在クラスkの起こりやすさを示すものであり、「潜在クラス事前分布p_k」とは異なる分布を指す。 If we define the probability p (x _t | Θ _k ) using a conventional multivariate continuous distribution function and parametrize p (k) to directly estimate the multinomial distribution, that is, the probability p (k) = p _k The adjustable parameters of the model are the NN parameter Λ _k and the feature value generation distribution parameters Θ _k and p _k . Hereinafter, p _k is referred to as a latent class prior distribution or a latent class prior distribution parameter from the viewpoint of indicating the probability of a latent class (indicating the likelihood of each latent class for all latent classes). The “latent class distribution q _{n, k} ” described later indicates the likelihood of the latent class k after observing the corresponding nth data, and the “latent class prior distribution p _k ” Refers to a different distribution.

確率p(x_t|Θ_k)として使われる多変量連続分布としては、以下の式で示される正規分布や混合正規分布がある。 The multivariate continuous distribution used as the probability p (x _t | Θ _k ) includes a normal distribution and a mixed normal distribution represented by the following expressions.

ここで here

であり、μ_ｋは潜在クラスkに属する特徴生成確率p(x|k)の平均ベクトル、V_kは潜在クラスkに属する特徴生成確率p(x|k)の共分散行列を表わす。以降では、潜在クラスkを考慮したパラメタの推定法について説明する。 Μ _k is an average vector of feature generation probabilities p (x | k) belonging to latent class k, and V _k is a covariance matrix of feature generation probabilities p (x | k) belonging to latent class k. Hereinafter, a parameter estimation method considering the latent class k will be described.

＜潜在クラスkを考慮したパラメタの推定法＞
定義した潜在クラスモデルは混合分布なので、既存のEMアルゴリズムを用いて全てのパラメタ(Θ_k,Λ_k,p_k)を推定することができる（参考文献１参照）。
［参考文献１］ A. P. Dempster, N. M. Laird, D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of Royal Statistical Society, 1977, Series B, Vol. 39, No. 1, pp.1-38. <Parameter estimation method considering latent class k>
Since the defined latent class model is a mixture distribution, all parameters (Θ _k , Λ _k , p _k ) can be estimated using an existing EM algorithm (see Reference 1).
[Reference 1] AP Dempster, NM Laird, DB Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of Royal Statistical Society, 1977, Series B, Vol. 39, No. 1, pp.1-38 .

（混合モデルとEMアルゴリズム）
以下、EMアルゴリズムを用いてパラメタを推定する方法について概略を説明する。 (Mixed model and EM algorithm)
Hereinafter, an outline of a method for estimating a parameter using the EM algorithm will be described.

パラメタΘを用いて変数Aについての確率分布p(A|Θ)を定義することを考える。いくつかの潜在クラスkにおける変数Aの確率分布を、潜在クラスk毎に定義したパラメタθ_kを用いて定義し（確率分布p(A|k,θ_k)）、kについて周辺化することによって確率分布p(A|Θ)を定義することが可能である（式(10)参照）。 Consider defining a probability distribution p (A | Θ) for variable A using parameter Θ. By defining the probability distribution of variable A in several latent classes k using the parameter θ _k defined for each latent class k (probability distribution p (A | k, θ _k )) and by marginalizing k It is possible to define a probability distribution p (A | Θ) (see equation (10)).

この場合、パラメタΘは In this case, the parameter Θ is

となる。 It becomes.

このように複雑な確率分布をモデル化可能な複数の小さなクラスでモデル化することによって表現力を上げる手法を混合モデル化という。 A technique for improving expressive power by modeling in this way a plurality of small classes that can model a complex probability distribution is called mixed modeling.

混合モデルの学習として、学習データにおける対数尤度を最大にする手法がとられるのが一般的である。学習データA⁽¹⁾,A⁽²⁾,…,A⁽ⁱ⁾,…を利用し、最適なパラメタを以下のようにおくことができる。 As learning of the mixed model, a method of maximizing the log likelihood in the learning data is generally taken. Using learning data A ⁽¹⁾ , A ⁽²⁾ ,..., A ⁽ⁱ⁾ ,..., Optimal parameters can be set as follows.

混合モデルの学習には、EMアルゴリズムという手法が用いられるのが一般的である（参考文献１参照）。EMアルゴリズムではJensenの不等式と、占有度パラメタq_i,k(Σ_kq_i,k=1)を用いて以下のように下界を導出する。 In general, a method called an EM algorithm is used for learning a mixed model (see Reference 1). In the EM algorithm, the lower bound is derived as follows using Jensen's inequality and occupancy parameters q _{i, k} (Σ _k q _{i, k} = 1).

この不等式はq_i,k=p(k|A⁽ⁱ⁾,Θ)の時に等号が成立するため、q_i,kを逐次現在の推定パラメタΘ’から推定しながら下界を最大化するパラメタΘを推定していくことによって、EMアルゴリズムを構成することができる。すなわちEMアルゴリズムは以下の2ステップを繰り返すことにより実行される。
E-step:与えられたパラメタΘ’からq_i,k=p(k|A⁽ⁱ⁾,Θ’)を推定する。
M-step:得られたq_i,kを上式(12)に代入し、Q(Θ;q_i,k)を最大化するΘを求め、Θ’に代入する。 Since this equality holds when q _{i, k} = p (k | A ⁽ⁱ⁾ , Θ), the parameter that maximizes the lower bound while sequentially estimating q _{i, k} from the current estimation parameter Θ ' By estimating Θ, an EM algorithm can be constructed. That is, the EM algorithm is executed by repeating the following two steps.
E-step: Estimate q _{i, k} = p (k | A ⁽ⁱ⁾ , Θ ′) from the given parameter Θ ′.
M-step: The obtained q _{i, k} is substituted into the above equation (12), Θ that maximizes Q (Θ; q _{i, k} ) is obtained, and is substituted into Θ ′.

なお、EMアルゴリズムのM-stepは必ずしもQ(Θ;q_i,k)を最大化する厳密解を求める必要はなく、必要に応じて数値解で代用してもよい。 Note that the M-step of the EM algorithm does not necessarily require an exact solution that maximizes Q (Θ; q _{i, k} ), and may be substituted with a numerical solution as necessary.

＜本実施形態におけるパラメタの推定＞
EMアルゴリズムでは潜在クラス分布の推定値(式(12)におけるq_i,k)を算出する必要があるが、本実施形態のモデルの場合、発話n毎に潜在クラスkがあり、潜在クラスkが与えられた上での観測値の確率分布(式(8)のp(x,s|k,Λ_k,Θ_k)のように表わされるため、潜在クラス分布の推定値q_n,kは各パラメタの推定値Λ_k’_,Θk’,p_k’を用いて以下のように表わされる。 <Estimation of parameters in this embodiment>
In the EM algorithm, it is necessary to calculate the estimated value of the latent class distribution (q _{i, k} in Equation (12)), but in the model of this embodiment, there is a latent class k for each utterance n, and the latent class k is Given the probability distribution of the observed values given (p (x, s | k, Λ _k , Θ _k ) in Eq. (8), the latent class distribution estimate q _{n, k} Using parameter estimates Λ _k ′ _{, Θ} _k ′ _, and p _k ′, they are expressed as follows:

本実施形態のモデルの場合、潜在クラス分布の推定値q_n,kが定まった上での下界(式(12)におけるQ(Θ;q_i,k))の最適化は以下のように書くことができる。 In the case of the model of this embodiment, the optimization of the lower bound (Q (Θ; q _{i, k} ) in Equation (12)) after the estimated value q _{n, k of} the latent class distribution is determined is written as follows: be able to.

この最適化は変数を共有しない複数の項の和の最適化であるため、変数毎に分解して最適化を解くことで最適値が求められる。すなわち式(14)の最適化は以下の3×K個の最適化問題に分解される。 Since this optimization is an optimization of the sum of a plurality of terms that do not share a variable, an optimum value is obtained by decomposing each variable and solving the optimization. That is, the optimization of equation (14) is broken down into the following 3 × K optimization problems.

ここで、式(15)のp(s_t ⁽ⁿ⁾|x_t ⁽ⁿ⁾,Λ_k)は、音響モデルΛ_kとn番目の学習データのt番目のフレームの特徴量ベクトルx_t ⁽ⁿ⁾を条件とする状態変数s_t ⁽ⁿ⁾の出力確率であり、音響モデルΛ_kと特徴量ベクトルx_t ⁽ⁿ⁾に対する状態変数s_t ⁽ⁿ⁾の正解確率と捉えることができる。また、式(15)の潜在クラス分布の推定値q_n,kはn番目の学習データにおける潜在クラスkの重みと捉えることができる。式(15)は、全ての学習データ及び全ての潜在クラスについての、学習データ毎の状態変数系列の正解確率と、その学習データにおける潜在クラスkの重みq_n,kとの積の総和が最大となるようにNNパラメタΛ_kを更新することを意味する。 Here, p (s _t ⁽ⁿ⁾ | x _t ⁽ⁿ⁾ , Λ _k ) in Expression (15) is the feature vector x _t ^{(n of the} _tth frame of the acoustic model Λ _k and the nth learning data. ⁾ and an output probability of the state variables s _t ⁽ⁿ⁾ to the condition can be regarded as correct answer probabilities of acoustic models lambda _k and the feature amount vector x _t ⁽ⁿ⁾ for the state variables s _t ^(n). Further, the estimated value q _{n, k} of the latent class distribution in equation (15) can be regarded as the weight of the latent class k in the nth learning data. Equation (15) shows that the sum of the products of the correct answer probability of the state variable sequence for each learning data and the weight q _{n, k} of the latent class k in the learning data is the maximum for all learning data and all latent classes. It means that the NN parameter Λ _k is updated so that

式(16)は、全ての学習データ及び全ての潜在クラスについて、特徴量生成分布パラメタΘ_kで示される確率分布から特徴量ベクトルx_t ⁽ⁿ⁾の系列{x₁ ⁽ⁿ⁾,x₂ ⁽ⁿ⁾,…,x_t ⁽ⁿ⁾,…}がサンプル（出力）される確率と、その学習データにおける潜在クラスkの重みq_n,kとの積の総和が最大となるように生成モデルパラメタΘ_kを更新することを意味する。 Equation (16) is the sequence {x ₁ ⁽ⁿ⁾ , x ₂ ⁽ ⁾ of the feature vector x _t ⁽ⁿ⁾ from the probability distribution indicated by the feature generation distribution parameter Θ _k for all learning data and all latent classes. ⁿ⁾ , ..., x _t ⁽ⁿ⁾ , ...} are generated model parameters so that the sum of products of the probability of sampling (output) and the weight q _{n, k} of latent class k in the learning data is maximized. It is meant to update the Θ _k.

式(17)は、推定された潜在クラス分布q_n,kに最も近い潜在クラス事前分布p_kを最適化することを意味する。 Equation (17) means that the latent class prior distribution p _k that is closest to the estimated latent class distribution q _n, _k is optimized.

潜在クラス事前分布p_kに関する最適化(式(17))は、 Optimization for the latent class prior distribution p _k (equation (17)) is

に最適解が存在することが知られている。つまり、潜在クラス事前分布p_kは、全ての潜在クラスの潜在クラス分布の総和に対する各潜在クラスの潜在クラス分布の割合として求めることができる。 It is known that there exists an optimal solution. That is, the latent class prior distribution p _k can be obtained as a ratio of the latent class distribution of each latent class to the sum of the latent class distributions of all the latent classes.

また、特徴量生成分布パラメタΘ_kの最適化(式(16))に関しても、正規分布のような簡単な分布であれば最適解が解析的に導出できることが知られている。 Also, regarding the optimization of the feature quantity generation distribution parameter Θ _k (equation (16)), it is known that an optimal solution can be derived analytically if it is a simple distribution such as a normal distribution.

しかし、NNパラメタΛ_kに関する最適化(式(15))に関してはNNの特性上、最適解を解析的に導出することは、ごく一部の場合を除いてできない。そこでNNパラメタΛ_kの最適化に関してはバックプロパゲーション法（非特許文献１参照）を用いて数値的に解く必要がある。 However, with regard to the optimization related to the NN parameter Λ _k (equation (15)), the optimal solution cannot be derived analytically due to the characteristics of the NN except in a few cases. Therefore, it is necessary to numerically solve the optimization of the NN parameter Λ _k by using the back propagation method (see Non-Patent Document 1).

これらを踏まえて具体的にEMアルゴリズムを構成するには、学習用特徴量系列群X={X⁽¹⁾,X⁽²⁾ ,…,X⁽ⁿ⁾ ,…}と学習用状態変数系列群S={s⁽¹⁾ ,s⁽²⁾ ,…,s⁽ⁿ⁾ ,…}、パラメタの推定値 In order to construct the EM algorithm specifically based on these, the learning feature sequence group X = {X ⁽¹⁾ , X ⁽²⁾ ,…, X ⁽ⁿ⁾ ,…} and the learning state variable sequence group S = {s ⁽¹⁾ , s ⁽²⁾ ,…, s ⁽ⁿ⁾ ,…}, parameter estimates

を用いて、以下のステップを繰り返すことによって最適化を行う。
E-step(1):全てのk,n及びtに対し、式(9')に従って、p(x_t ⁽ⁿ⁾|Θ_k’)の計算を行う。 , The following steps are repeated to optimize.
E-step (1): p (x _t ⁽ⁿ⁾ | Θ _k ′) is calculated for all k, n, and t according to equation (9 ′).

E-step(2):全てのk,n及びtに対し、式(2')に従って、p(s_t ⁽ⁿ⁾|x_t ⁽ⁿ⁾,Λ_k’)の計算を行う。 E-step (2): For all k, n and t, p (s _t ⁽ⁿ⁾ | x _t ⁽ⁿ⁾ , Λ _k ′) is calculated according to the equation (2 ′).

E-step(3):全てのk,nに対し、式(13)に従って、潜在クラス分布q_n,kの計算を行う。 E-step (3): The latent class distribution q _{n, k} is calculated according to the equation (13) for all k, n.

M-step(1):得られた潜在クラス分布q_n,kを用い最適化(式(15))をバックプロパゲーション法（非特許文献１参照）によって実行する。 M-step (1): Optimization (equation (15)) is executed by the back propagation method (see Non-Patent Document 1) using the obtained latent class distribution q _{n, k} .

M-step(2):得られた潜在クラス分布q_n,kを用い最適化(式(16))を特徴量生成分布に応じた方法で実行する。 M-step (2): Optimization (equation (16)) is executed using the obtained latent class distribution q _{n, k} by a method corresponding to the feature value generation distribution.

M-step(3):得られた潜在クラス分布q_n,kを用い、潜在クラス事前分布p_kを M-step (3): Using the obtained latent class distribution q _{n, k} , the latent class prior distribution p _k

のように更新する。 Update like this.

一連の処理は、潜在クラス分布q_n,kに以下に示すViterbi近似を導入することによって高速化可能である。 A series of processing can be accelerated by introducing the following Viterbi approximation to the latent class distribution q _{n, k} .

Viterbi近似では潜在クラス分布q_n,kの計算として、式(13)ではなく、以下の近似式を用いる。 In the Viterbi approximation, the following approximate expression is used instead of the expression (13) for calculating the latent class distribution q _{n, k} .

ここでδ_i,jはクロネッカのデルタであり、i=jの時のみ1、他の場合0となる変数である。この近似を用いることで、潜在クラス分布q_n,kの多くの要素がゼロとなるため、実質の計算時間を大幅に削減することができる。 Here, δ _{i, j} is a Kronecker delta, and is a variable that is 1 only when i = j and 0 in other cases. By using this approximation, many elements of the latent class distribution q _{n, k} become zero, so that the substantial calculation time can be greatly reduced.

＜音響モデル学習装置１００＞
図３に第一実施形態に係る音響モデル学習装置１００の処理フローを、図４にその構成例を示す。 <Acoustic model learning apparatus 100>
FIG. 3 shows a processing flow of the acoustic model learning apparatus 100 according to the first embodiment, and FIG. 4 shows a configuration example thereof.

音響モデル学習装置１００は、音響モデル格納部１１０、音響モデル学習部１２０、潜在クラス分布学習部１３０及び反復制御部１４０を含む。音響モデル学習装置１００は、学習用特徴量系列群X及び学習用状態変数系列群Sを受け取り、これらのデータを用いて、潜在クラスkの異なるK個のNN音響モデルを学習し、出力する。 The acoustic model learning device 100 includes an acoustic model storage unit 110, an acoustic model learning unit 120, a latent class distribution learning unit 130, and an iterative control unit 140. The acoustic model learning device 100 receives the learning feature quantity sequence group X and the learning state variable sequence group S, and learns and outputs K NN acoustic models having different latent classes k using these data.

＜音響モデル格納部１１０＞
音響モデル格納部１１０には、NN音響モデルとして、潜在クラスk毎の、NNパラメタΛ_k、特徴量生成分布パラメタΘ_kと、潜在クラス事前分布パラメタp_kと、潜在クラス分布q_n,kとが格納される。 <Acoustic model storage unit 110>
In the acoustic model storage unit 110, as the NN acoustic model, the NN parameter Λ _k , the feature amount generation distribution parameter Θ _k , the latent class prior distribution parameter p _k, and the latent class distribution q _{n, k} for each latent class k are stored. Is stored.

音響モデル学習装置１００は、各パラメタの学習に先立ち、NNパラメタΛ_kと潜在クラス分布q_n,kとを初期化し（ｓ１０１）、その初期値を音響モデル格納部１１０に格納しておく。 Prior to learning each parameter, the acoustic model learning device 100 initializes the NN parameter Λ _k and the latent class distribution q _{n, k} (s101), and stores the initial values in the acoustic model storage unit 110.

潜在クラス分布q_n,kの初期値は乱数を代入する。Σ_k q_n,k=1、q_n,k>0を満たすような乱数であればなんでもよい。例えば、正整数{1,…,K}から一様無作為に選んだr_nを用いてq_n,k=δ_n,rn（ただし、下付文字rnはr_nを表す）として初期化してもよい。なお、潜在クラス分布の初期化処理を省略し、潜在クラス分布q_n,kに予め適当な値を設定して音響モデル格納部１１０に格納しておき、その値を初期値として利用する構成としてもよい。このように、潜在クラス分布q_n,kの初期値として、異なる値を設定することで、同一の学習データに対して潜在クラスk毎に性質の異なるNN音響モデルを学習することができる。 A random number is substituted for the initial value of the latent class distribution q _{n, k} . Any random number satisfying Σ _k q _{n, k} = 1, q _{n, k} > 0 may be used. For example, a positive integer {1, ..., K} q n with r _n chosen uniformly _{randomly, k} = [delta] _{n, rn} (where subscript rn represents r _n) is initialized as Also good. Note that the initialization process of the latent class distribution is omitted, an appropriate value is set in advance in the latent class distribution q _{n, k} and stored in the acoustic model storage unit 110, and the value is used as an initial value. Also good. In this way, by setting different values as initial values of the latent class distribution q _{n, k} , it is possible to learn NN acoustic models having different properties for each latent class k with respect to the same learning data.

NNパラメタΛ_kの初期値としては何を与えてもよいが、例えば、全学習データを用いて上述の従来のNN音響モデルの作成法で学習したNNパラメタΛを、全てのkについてのNNパラメタΛ_kとする。この学習は、後述のNN学習部１２１において式(15)をq_n,k=1の設定で行うことによって実行できる。また、結果は音響モデル格納部１１０に格納される。 Any value can be given as the initial value of the NN parameter Λ _k . For example, the NN parameter Λ trained by the above-described conventional NN acoustic model creation method using all the learning data is used as the NN parameter for all k. Let Λ _k . This learning can be executed by performing the equation (15) with the setting of q _{n, k} = 1 in the NN learning unit 121 described later. The result is stored in the acoustic model storage unit 110.

＜音響モデル学習部１２０＞
音響モデル学習部１２０は、NN学習部１２１と、特徴量生成分布学習部１２２と、潜在クラス事前分布学習部１２３とを含む。音響モデル学習部１２０は、学習用状態変数系列群S={s⁽¹⁾,s⁽²⁾,…,s⁽ⁿ⁾,…}と学習用特徴量系列群X={X⁽¹⁾,X⁽²⁾,…,X⁽ⁿ⁾,…}から、それぞれNNパラメタΛ_k、特徴量生成分布パラメタΘ_k及び潜在クラス事前分布パラメタp_kを学習する。NN学習部１２１、特徴量生成分布学習部１２２及び潜在クラス事前分布学習部１２３の処理は、どの順番で行っても問題ない。 <Acoustic model learning unit 120>
The acoustic model learning unit 120 includes an NN learning unit 121, a feature amount generation distribution learning unit 122, and a latent class prior distribution learning unit 123. The acoustic model learning unit 120 includes a learning state variable sequence group S = {s ⁽¹⁾ , s ⁽²⁾ ,..., S ⁽ⁿ⁾ , ...} and a learning feature quantity sequence group X = {X ⁽¹⁾ , NN parameter Λ _k , feature quantity generation distribution parameter Θ _k and latent class prior distribution parameter p _k are learned from X ⁽²⁾ ,..., X ⁽ⁿ⁾ ,. The processes of the NN learning unit 121, the feature quantity generation distribution learning unit 122, and the latent class prior distribution learning unit 123 may be performed in any order.

（NN学習部１２１）
NN学習部１２１は、音響モデル格納部１１０から読みだした潜在クラス分布q_n,kと、入力された学習用状態変数系列群Sと学習用特徴量系列群Xとを用いて、式(15)によりNNパラメタΛ_kを学習し（ｓ１０２）、音響モデル格納部１１０に格納されたNNパラメタΛ_kを更新する。 (NN learning unit 121)
The NN learning unit 121 uses the latent class distribution q _{n, k} read from the acoustic model storage unit 110 and the input learning state variable sequence group S and learning feature amount sequence group X to obtain an equation (15 ) To learn the NN parameter Λ _k (s102) and update the NN parameter Λ _k stored in the acoustic model storage unit 110.

例えば、上述したM-step(1)を実行することに相当する。なお、M-Step(1)のバックプロパゲーション法による反復処理は、所定回数繰り返した段階で更新処理を終了するものとする。 For example, this corresponds to executing the above-described M-step (1). Note that the iterative process by the back-propagation method of M-Step (1) ends the update process when it is repeated a predetermined number of times.

（特徴量生成分布学習部１２２）
特徴量生成分布学習部１２２は、音響モデル格納部１１０から読みだした潜在クラス分布q_n,kと、入力された学習用特徴量系列群Xとを用いて、式(16)を満たす特徴量生成分布パラメタΘ’_kを学習し（ｓ１０３）、音響モデル格納部１１０に格納された特徴量生成分布パラメタΘ_kを更新する。 (Feature generation generation learning unit 122)
The feature value generation distribution learning unit 122 uses the latent class distribution q _{n, k} read from the acoustic model storage unit 110 and the input feature value series X for learning to satisfy the feature (16). The generation distribution parameter Θ ′ _k is learned (s 103), and the feature amount generation distribution parameter Θ _k stored in the acoustic model storage unit 110 is updated.

（潜在クラス事前分布学習部１２３）
潜在クラス事前分布学習部１２３は、音響モデル格納部１１０から読みだした潜在クラス分布q_n,kを用いて、式(17’)により潜在クラス事前分布パラメタp_kを学習し（ｓ１０４）、音響モデル格納部１１０に格納された潜在クラス事前分布パラメタp_kを更新する。 (Latent class prior distribution learning unit 123)
The latent class prior distribution learning unit 123 uses the latent class distribution q _{n, k} read from the acoustic model storage unit 110 to learn the latent class prior distribution parameter p _{k according} to the equation (17 ′) (s104), The latent class prior distribution parameter p _k stored in the model storage unit 110 is updated.

＜潜在クラス分布学習部１３０＞
潜在クラス分布学習部１３０は、音響モデル学習部１２０で更新された各パラメタ(NNパラメタΛ_k、特徴量生成分布パラメタΘ_k、潜在クラス事前分布パラメタp_k)と、入力された学習用状態変数系列群S及び学習用特徴量系列群Xを用いて、式(13)により、潜在クラス分布q_n,kを学習し（ｓ１０５）、音響モデル格納部１２０に格納された潜在クラス分布q_n,kを更新する。 <Latent class distribution learning unit 130>
The latent class distribution learning unit 130 includes each parameter (NN parameter Λ _k , feature value generation distribution parameter Θ _k , latent class prior distribution parameter p _k ) updated by the acoustic model learning unit 120 and the input learning state variable. Using the sequence group S and the learning feature sequence group X, the latent class distribution q _{n, k} is learned by the equation (13) (s105), and the latent class distribution q _n, Update _k .

なお、式(13)中に登場する確率p(s⁽ⁿ⁾ _t|x⁽ⁿ⁾ _t,Λ_t’)や確率p(x⁽ⁿ⁾ _t|Θ_k')として、NN学習部１２１や特徴量生成分布学習部１２２で求めたものを援用してもよい。 The probability p (s ⁽ⁿ⁾ _t | x ⁽ⁿ⁾ _t , Λ _t ′) and the probability p (x ⁽ⁿ⁾ _t | Θ _k ′) appearing in equation (13) are What was calculated | required in the feature-value production | generation distribution learning part 122 may be used.

＜反復制御部１４０＞
反復制御部１４０は、更新処理が収束したか否か判定し（ｓ１０６）、収束していたら学習用状態変数系列群S及び学習用特徴量系列群Xに対する更新処理を終了する。例えば、実行時間を計測しておき、所定時間に到達したら収束したと判定してもよいし、音響モデル学習部１２０や潜在クラス分布学習部１３０における更新回数をカウントしておき、所定回数に到達したら収束したと判定してもよい。収束したと判定されなければ、音響モデル学習部１２０及び潜在クラス分布学習部１３０に処理を繰り返すように制御信号を出力する。 <Repetition control unit 140>
The iterative control unit 140 determines whether or not the update process has converged (s106), and if it has converged, ends the update process for the learning state variable series group S and the learning feature quantity series group X. For example, the execution time may be measured and determined to have converged when the predetermined time is reached, or the number of updates in the acoustic model learning unit 120 or the latent class distribution learning unit 130 is counted and the predetermined number of times is reached. Then, it may be determined that it has converged. If it is not determined that it has converged, a control signal is output to the acoustic model learning unit 120 and the latent class distribution learning unit 130 so as to repeat the processing.

＜効果＞
このような構成により、同一の学習データから複数の性質の異なるNN音響モデルを学習することができる。 <Effect>
With such a configuration, a plurality of NN acoustic models having different properties can be learned from the same learning data.

＜音声認識の原理＞
音声認識装置２００は、潜在クラスk毎に異なる複数のNN音響モデルを具備する点が従来の音声認識装置とは異なる。潜在クラスk毎に異なる複数のNN音響モデルは、音響モデル学習装置１００により構築することができる。音声認識装置２００は、潜在クラスk毎に異なる複数のNN音響モデルを使い分けて音声認識を行う。 <Principle of voice recognition>
The speech recognition apparatus 200 is different from the conventional speech recognition apparatus in that it includes a plurality of different NN acoustic models for each latent class k. A plurality of NN acoustic models different for each latent class k can be constructed by the acoustic model learning device 100. The speech recognition apparatus 200 performs speech recognition by using a plurality of different NN acoustic models for each latent class k.

これまでも、複数の混合ガウス分布の音響モデルがある際にそれらを使いわけて音声認識精度を高める手法については、（１）システムコンビネーションによる手法(参考文献２参照)、（２）潜在クラス事前分布p_kの再推定による手法、（３）モデル選択による手法等が提案されている。
［参考文献２］G. Evermann, P. Woodland, "Posterior probability decoding, confidence estimation and system combination", Proc. NIST Speech Transcription Workshop, 2000 Up to now, when there are multiple acoustic models with mixed Gaussian distribution, the methods to improve the speech recognition accuracy by using them are as follows: (1) System combination method (see Reference 2), (2) Latent class advance A method based on re-estimation of the distribution p _k and (3) a method based on model selection have been proposed.
[Reference 2] G. Evermann, P. Woodland, "Posterior probability decoding, confidence estimation and system combination", Proc. NIST Speech Transcription Workshop, 2000

しかしながら、複数のNN音響モデルがあり、これらを使い分ける技術は知られていなかった。 However, there are multiple NN acoustic models, and the technology to properly use these models has not been known.

以下では「モデル選択による手法」を複数のNN音響モデルに応用した場合の実施形態を解説するが、本実施形態の特徴は、潜在クラスが異なる複数の音響モデルを利用して音声認識を行う構成にあり、音声認識の具体処理部分はどのように実現されてもよい。例えば、システムコンビネーションや潜在クラス分布の際推定による手法を応用した実施形態とすることも可能である。 In the following, an embodiment in which the “method by model selection” is applied to a plurality of NN acoustic models will be described. The feature of this embodiment is a configuration for performing speech recognition using a plurality of acoustic models having different latent classes. Thus, the specific processing part of the voice recognition may be realized in any way. For example, an embodiment in which a method based on system combination or latent class distribution estimation is applied may be used.

つまり、音声認識装置２００は、入力される音声データから各潜在クラスの重みを推定し、推定した潜在クラス毎の重みと、言語モデルと潜在クラス毎に異なる複数のNN音響モデルとに基づいて、音声データに対する音声認識を行う。その重みの推定方法や、重みを利用したNN音響モデルの処理方法については様々な方法が考えられる。 That is, the speech recognition apparatus 200 estimates the weight of each latent class from the input speech data, and based on the estimated weight for each latent class and a plurality of NN acoustic models that differ for each latent class, Performs voice recognition on voice data. There are various methods for estimating the weight and processing the NN acoustic model using the weight.

モデル選択による手法では入力される音声データに対応すると考えられる潜在クラスkを最初に推定し、潜在クラスkに対応する音響モデルのみを使って音声認識を行う。よって、選択したNN音響モデルに対する重みを1と推定し、他のNN音響モデルに対する重みを0と推定していると考えられる。なお、学習データに含まれない特徴量系列xに対する潜在クラスの事後分布は、以下のように全ての可能な状態変数系列s'についての総和(Σ_s')で表現する必要がある。 In the model selection method, a latent class k that is considered to correspond to input speech data is first estimated, and speech recognition is performed using only an acoustic model corresponding to the latent class k. Therefore, it is considered that the weight for the selected NN acoustic model is estimated as 1 and the weight for the other NN acoustic models is estimated as 0. It should be noted that the posterior distribution of the latent class for the feature quantity series x not included in the learning data needs to be expressed as a sum (Σ _{s ′} ) for all possible state variable series s ′ as follows.

一般に、この総和は簡単には計算できないので、本実施形態では、1-best近似を用いた。1-best近似では、まず最尤となる最適状態変数系列^s_kを潜在クラスk毎に一つずつ、以下のように計算する。 In general, since this sum cannot be easily calculated, 1-best approximation is used in this embodiment. 1-best in approximation, one by one the optimal state variable sequence ^ s _k firstly the maximum likelihood for each latent class k, is calculated as follows.

このようにして求めた最適状態変数系列^s_kを用いて、潜在クラスの事後分布を以下のように近似する。 In this way using the optimal state variable sequence ^ s _k obtained is approximated as follows posterior distribution of the latent class.

ここで、αとβはモデル合成のための調整項目（スケールファクタ）であり、予め値を設定しておくものとする。 Here, α and β are adjustment items (scale factors) for model synthesis, and values are set in advance.

この近似を用いた上で入力音声Xに対応する潜在クラス^kを以下のように求める。 Using this approximation, the latent class ^ k corresponding to the input speech X is obtained as follows.

そして、従来の音声認識処理を潜在クラス^kに対応する音響モデルパラメタ^Λ_kを用いて実行する。一発話毎に、潜在クラス^kを推定することで、各発話の環境や話者の特性に応じた潜在クラスを考慮した認識が可能となる。 Then, the conventional speech recognition process is executed using the acoustic model parameter ^ Λ _k corresponding to the latent class ^ k. By estimating the latent class ^ k for each utterance, it is possible to recognize in consideration of the latent class according to the environment of the utterance and the characteristics of the speaker.

＜音声認識装置２００＞
以上の理論に基づいて構成される音声認識装置２００の実施形態を説明する。図５に本実施形態に係る音声認識装置２００の処理フローを、図６にその構成例を示す。 <Voice recognition apparatus 200>
An embodiment of the speech recognition apparatus 200 configured based on the above theory will be described. FIG. 5 shows a processing flow of the speech recognition apparatus 200 according to the present embodiment, and FIG. 6 shows a configuration example thereof.

音声認識装置２００は、特徴量抽出部２１０、最適状態変数系列推定部２２０、最適潜在クラス推定部２３０、単語列探索部２４０、音響モデル格納部２５０及び言語モデル格納部２６０を含む。 The speech recognition apparatus 200 includes a feature amount extraction unit 210, an optimal state variable sequence estimation unit 220, an optimal latent class estimation unit 230, a word string search unit 240, an acoustic model storage unit 250, and a language model storage unit 260.

音声認識装置２００は、認識用の音声データを受け取り、その音声データに対して音声認識を行い、認識結果の単語列を出力する。 The speech recognition apparatus 200 receives speech data for recognition, performs speech recognition on the speech data, and outputs a recognition result word string.

＜音響モデル格納部２５０及び言語モデル格納部２６０＞
音響モデル格納部２５０には、音響モデル学習装置１００で学習した潜在クラスkの異なるK個のNN音響モデル（Λ_t,Θ_k,p_k,q_n,k)が格納されている。また、言語モデル格納部２６０には、言語モデルが格納されている。なお、言語モデルは既存の技術に基づくものを用いればよい。 <Acoustic model storage unit 250 and language model storage unit 260>
The acoustic model storage unit 250 stores K NN acoustic models (Λ _t , Θ _k , p _k , q _{n, k} ) having different latent classes k learned by the acoustic model learning device 100. The language model storage unit 260 stores language models. A language model based on existing technology may be used.

＜特徴量抽出部２１０＞
特徴量抽出部２１０は、入力された音声データを受け取り、音声データから特徴量ベクトルx_tを抽出し（ｓ２０１）、出力する。特徴量抽出として、既存の技術を用いることができる。 <Feature Extraction Unit 210>
Feature amount extraction unit 210 receives the input audio data, extracts a feature vector x _t from the voice data (s201), and outputs. An existing technique can be used for feature amount extraction.

＜最適状態変数系列推定部２２０＞
最適状態変数系列推定部２２０は、特徴量抽出部２１０で抽出した複数の特徴量ベクトルx_tの系列X⁽ⁿ⁾の群（以下、「認識用特徴量系列群」ともいう）Xの一部（例えば、一発話に対する複数の特徴量ベクトルx_t ⁽ⁿ⁾の系列X⁽ⁿ⁾）、及び音響モデル格納部２５０に格納された潜在クラス毎の潜在クラス事前分布パラメタp_k、NNパラメタΛ_k、特徴量生成分布パラメタΘ_kを用いて、式(20a)を計算して、K個の最適状態変数系列(^s_k)を推定し（ｓ２０２）、最適潜在クラス推定部２３０に渡す。 <Optimum State Variable Sequence Estimator 220>
The optimum state variable sequence estimation unit 220 is a part of a group X ⁽ⁿ⁾ of a series X ⁽ⁿ⁾ of a plurality of feature quantity vectors x _t extracted by the feature quantity extraction unit 210 (hereinafter also referred to as “recognition feature quantity series group”). (For example, a sequence X ^{(n) of} a plurality of feature vectors x _t ⁽ⁿ⁾ for one utterance), and a latent class prior distribution parameter p _k and an NN parameter Λ _{k for} each latent class stored in the acoustic model storage unit 250 Then, using the feature quantity generation distribution parameter Θ _k , the equation (20a) is calculated, K optimum state variable sequences (^ s _k ) are estimated (s 202), and passed to the optimum latent class estimation unit 230.

＜最適潜在クラス推定部２３０＞
最適潜在クラス推定部２３０は、認識用特徴量系列群Xの一部（例えばX⁽ⁿ⁾）と、最適状態変数系列推定部２２０で推定した最適状態系列^s_kと、音響モデル格納部２５０に格納された潜在クラス事前分布パラメタp_k、NNパラメタΛ_k、特徴量生成分布パラメタΘ_kを用いて、式(21)及び式(22)により、認識用特徴量系列群Xの一部に対する最適潜在クラス^kを選択し（ｓ２０３）、単語列探索部２４０に出力する。 <Optimum latent class estimation unit 230>
Optimum latent classes estimator 230, a part of the recognition feature amount sequence group X (e.g. X ^(n)), and the optimal state sequence ^ s _k estimated by the optimum state variable sequence estimating unit 220, the acoustic model storage unit 250 Using the latent class prior distribution parameter p _k , the NN parameter Λ _k , and the feature quantity generation distribution parameter Θ _k stored in, for a part of the recognition feature quantity sequence group X by Equation (21) and Equation (22) The optimal latent class ^ k is selected (s203) and output to the word string search unit 240.

＜単語列探索部２４０＞
単語列探索部２４０は、従来の音声認識器と同様に言語モデル格納部２６０に格納された言語モデルと、音響モデル格納部２５０に格納された音響モデルと、認識用特徴量系列群Xとを用いて、認識用音声データ（より詳しくいうと、認識用特徴量系列群X）にマッチする単語列を探索し（ｓ２０４）、探索結果である単語列を出力する。ただし、従来の音声認識装置９０における単語列探索部９２とは異なり、音響モデルとして、音響モデル格納部２５０に格納された複数のNN音響モデルのうち最適潜在クラス推定部２３０で選択された最適潜在クラス^kに対応するNNパラメタΛ_^kを用いる。その際、発話n毎に最適潜在クラス^kを推定し、用いるNNパラメタΛ_^kを変更することで、話者／環境への高速適応が可能になる。 <Word string search unit 240>
The word string search unit 240 uses the language model stored in the language model storage unit 260, the acoustic model stored in the acoustic model storage unit 250, and the recognition feature quantity sequence group X in the same manner as the conventional speech recognizer. Then, a word string that matches the recognition speech data (more specifically, the recognition feature quantity sequence group X) is searched (s204), and the word string that is the search result is output. However, unlike the word string search unit 92 in the conventional speech recognition apparatus 90, the optimum latent selected by the optimum latent class estimation unit 230 among the plurality of NN acoustic models stored in the acoustic model storage unit 250 as an acoustic model. Use NN parameter Λ _{^ k} corresponding to class ^ k. At that time, by estimating the optimal latent class ^ k for each utterance n and changing the NN parameter Λ _{^ k} to be used, high-speed adaptation to the speaker / environment becomes possible.

＜シミュレーション結果＞
図７は、第一実施形態に係る音声認識装置２００の音声認識のシミュレーション結果を表す。シミュレーションに用いるコーパスはTIMITを採用した。学習セットと評価セットの発話数はそれぞれ3、696発話と392発話である。潜在クラスの数Kは2で、NNの隠れユニット数H⁽ⁱ⁾は1024を使用した。特徴量生成分布は多変量ガウス分布を用いた。学習において、q_n,kの初期値は、正整数{1,…,K}から一様無作為に選んだr_nを用いてq_n,k=δ_n,rnとした。モデル合成におけるスケールファクタα、β(式(21)参照)に関してはいくつかの値（(1)α=1.0,β=1.0、(2)α=1.0,β=0.0、(3)α=0.0,β=1.0、）を試行した。 <Simulation results>
FIG. 7 shows a simulation result of speech recognition of the speech recognition apparatus 200 according to the first embodiment. The corpus used for the simulation was TIMIT. The number of utterances in the learning set and the evaluation set is 3,696 utterances and 392 utterances, respectively. The number of latent classes K is 2, and the number of hidden units H ⁽ⁱ⁾ of NN is 1024. Multivariate Gaussian distribution was used for the feature generation distribution. In learning, the initial value of q _{n, k} was set to q _{n, k} = δ _{n, rn} using r _n uniformly selected from positive integers {1, ..., K}. Some values ((1) α = 1.0, β = 1.0, (2) α = 1.0, β = 0.0, (3) α = 0.0 for scale factors α, β (see equation (21)) in model synthesis , β = 1.0).

図７に示すように、第一実施形態に係る音声認識装置２００がNN音響モデルを使いわけることによって精度向上が果たせていることを確認できた。また利用する潜在クラスの選び方については、最も近似が厳密であると考えられる設定α=1.0、β=1.0以外でも十分な性能向上が見られることを確認した。 As shown in FIG. 7, it was confirmed that the speech recognition apparatus 200 according to the first embodiment was able to improve accuracy by using the NN acoustic model. In addition, as to how to select the latent class to be used, it was confirmed that sufficient performance improvement can be seen even when the setting α = 1.0, β = 1.0, which is considered to be the closest approximation.

＜効果＞
混合ガウス分布からなる音響モデルを用いた音声認識装置より一般に高い性能を持つと言われているNNからなる音響モデルを用いた音声認識装置において、従来不可能であった話者／環境への高速適応（適応用データを蓄積することなく、リアルタイムに適応処理を行うこと）が可能になるという効果を奏する。 <Effect>
High speed to the speaker / environment, which was impossible in the past in a speech recognition device using an acoustic model consisting of NN, which is generally said to have higher performance than a speech recognition device using an acoustic model consisting of a mixed Gaussian distribution There is an effect that adaptation (adaptation processing in real time without accumulating adaptation data) becomes possible.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

１００音響モデル学習装置
１１０音響モデル格納部
１２０音響モデル学習部
１２１学習部
１２２特徴量生成分布学習部
１２３潜在クラス事前分布学習部
１３０潜在クラス分布学習部
１４０反復制御部
２００音声認識装置
２１０特徴量抽出部
２２０最適状態変数系列推定部
２３０最適潜在クラス推定部
２４０単語列探索部
２５０音響モデル格納部
２６０言語モデル格納部 DESCRIPTION OF SYMBOLS 100 Acoustic model learning apparatus 110 Acoustic model storage part 120 Acoustic model learning part 121 Learning part 122 Feature quantity generation distribution learning part 123 Latent class prior distribution learning part 130 Latent class distribution learning part 140 Repetition control part 200 Speech recognition apparatus 210 Feature quantity extraction Unit 220 optimal state variable sequence estimation unit 230 optimal latent class estimation unit 240 word string search unit 250 acoustic model storage unit 260 language model storage unit

Claims

A storage unit storing a language model and a plurality of neural network acoustic models different for each latent class;
Speech recognition for the speech data based on estimated weights for each latent class and a plurality of neural network acoustic models that differ for each language class and latent class from the input speech data. And
In the storage unit, as the neural network acoustic model, for each latent class, a latent class prior distribution parameter indicating the likelihood of occurrence of a latent class, a feature amount generation distribution parameter that is a parameter of a feature amount generation distribution, and a neural network The neural network parameters of the acoustic model are stored,
Using the latent class prior distribution parameter, the neural network parameter, and the feature value generation distribution parameter, an optimal state variable sequence estimation unit that estimates an optimal state variable sequence for the speech data for each latent class;
Using the latent class prior distribution parameter, the neural network parameter, the feature quantity generation distribution parameter, and the optimal state variable series, an optimal latent class estimation unit that selects an optimal latent class for the speech data;
Using a neural network parameter corresponding to the optimal latent class and the language model, further including a word string search unit for searching a word string for the speech data.
Voice recognition device.

A speech recognition apparatus according to claim 1 Symbol placement,
It is assumed that the latent class distribution indicates the probability of occurrence of the latent class after observing the learning data,
The neural network parameter uses the latent class distribution and the input learning state variable series group and learning feature quantity series group to maximize the sum of products of the correct probability of the state variable series and the latent class distribution. It was obtained by asking so that
The feature amount generation distribution parameter uses the latent class distribution and the input speech feature amount for learning so that the sum of the product of the probability that the feature amount of the learning data is output and the latent class distribution is maximized. Obtained by asking for
The latent class prior distribution parameter is obtained by calculating the ratio of the latent class distribution of each latent class to the sum of the latent class distributions of all the latent classes using the latent class distribution.
Voice recognition device.

A plurality of latent class prior distribution parameters indicating the likelihood of occurrence of latent classes, feature quantity generation distribution parameters that are parameters of feature quantity generation distribution, neural network parameters of a neural network acoustic model, and learning data An acoustic model storage unit that stores a latent class distribution indicating the likelihood of occurrence of a latent class after observing
Using the latent class distribution, a learning state variable sequence group that is a group of state variables of learning speech data, and a learning feature amount sequence group that is a group of feature amounts of learning speech data, A neural network learning unit for updating the neural network parameters;
A feature quantity generation distribution learning unit that updates the feature quantity generation distribution parameter using the latent class distribution and the input speech feature quantity for learning;
A latent class prior distribution learning unit that updates the latent class prior distribution parameters using the latent class distribution;
A latent class distribution learning unit that updates the latent class distribution using the neural network parameter, the feature quantity generation distribution parameter, the latent class prior distribution parameter, the input learning state series, and the learning speech feature quantity. Including
Until the update of the neural network parameter, the feature amount generation distribution parameter, the latent class prior distribution parameter and the latent class distribution converges, the neural network learning unit, the feature amount generation distribution learning unit, the latent class prior distribution learning unit, and Repeat the process in the latent class distribution learning unit,
Acoustic model learning device.

The acoustic model learning device according to claim 3 ,
k is a latent class index, n is an utterance index, t is a frame index, q _{n, k} is the latent class distribution, s _t ⁽ⁿ⁾ is a state variable of learning speech data, and x _t ⁽ⁿ⁾ is learned And the neural network learning unit updates the neural network parameter Λ _{k according} to the following equation:

The feature quantity generation distribution learning unit updates the feature quantity generation distribution parameter Θ _{k according} to the following equation:

The latent class prior distribution learning unit updates the latent class prior distribution parameter p _{k according} to the following equation:

The latent class distribution learning unit updates the latent class distribution q _{n, k according} to the following equation:

Acoustic model learning device.

Assume that the language model and multiple neural network acoustic models that differ for each latent class are stored in the storage unit,
Speech recognition for the speech data based on estimated weights for each latent class and a plurality of neural network acoustic models that differ for each language class and latent class from the input speech data. And
In the storage unit, as the neural network acoustic model, for each latent class, a latent class prior distribution parameter indicating the likelihood of occurrence of a latent class, a feature amount generation distribution parameter that is a parameter of a feature amount generation distribution, and a neural network Suppose that the neural network parameters of the acoustic model are stored,
Using the latent class prior distribution parameter, the neural network parameter, and the feature value generation distribution parameter, an optimal state variable sequence estimation step for estimating an optimal state variable sequence for the speech data for each latent class;
Using the latent class prior distribution parameter, the neural network parameter, the feature quantity generation distribution parameter, and the optimal state variable series, an optimal latent class estimating step of selecting an optimal latent class for the speech data;
A word string search step of searching for a word string for the speech data using the neural network parameter corresponding to the optimal latent class and the language model;
Speech recognition method.

The speech recognition method according to claim 5 ,
It is assumed that the latent class distribution indicates the probability of occurrence of the latent class after observing the learning data,
The neural network parameter uses the latent class distribution and the input learning state variable series group and learning feature quantity series group to maximize the sum of products of the correct probability of the state variable series and the latent class distribution. It was obtained by asking so that
The feature amount generation distribution parameter uses the latent class distribution and the input speech feature amount for learning so that the sum of the product of the probability that the feature amount of the learning data is output and the latent class distribution is maximized. Obtained by asking for
The latent class prior distribution parameter is obtained by calculating the ratio of the latent class distribution of each latent class to the sum of the latent class distributions of all the latent classes using the latent class distribution.
Speech recognition method.

The acoustic model storage unit includes a plurality of latent class prior distribution parameters indicating the likelihood of occurrence of latent classes, feature quantity generation distribution parameters that are parameters of the feature quantity generation distribution, and neural network acoustic models. A neural network parameter and a latent class distribution indicating the likelihood of occurrence of a latent class upon observation of learning data are stored.
Using the latent class distribution, a learning state variable sequence group that is a group of state variables of learning speech data, and a learning feature amount sequence group that is a group of feature amounts of learning speech data, A neural network learning step of updating the neural network parameters;
A feature quantity generation distribution learning step for updating the feature quantity generation distribution parameter using the latent class distribution and the input speech feature quantity for learning,
A latent class prior distribution learning step of updating the latent class prior distribution parameter using the latent class distribution;
A latent class distribution learning step of updating the latent class distribution using the neural network parameter, the feature quantity generation distribution parameter, the latent class prior distribution parameter, the input learning state series and the learning speech feature quantity. Including
Until the update of the neural network parameter, the feature quantity generation distribution parameter, the latent class prior distribution parameter and the latent class distribution converges, the neural network learning step, the feature quantity generation distribution learning step, the latent class prior distribution learning step, and Repeat the process in the latent class distribution learning step,
Acoustic model learning method.

The acoustic model learning method according to claim 7 ,
k is a latent class index, n is an utterance index, t is a frame index, q _{n, k} is the latent class distribution, s _t ⁽ⁿ⁾ is a state variable of learning speech data, and x _t ⁽ⁿ⁾ is learned The neural network parameter Λ _k is updated according to the following equation in the neural network learning step as the feature amount of the voice data for use:

In the feature quantity generation distribution learning step, the feature quantity generation distribution parameter Θ _k is updated by the following equation:

The latent class prior distribution learning step updates the latent class prior distribution parameter p _{k according} to the following equation:

In the latent class distribution learning step, the latent class distribution q _{n, k} is updated according to the following equation:

Acoustic model learning method.

Program for causing a computer to function as a speech recognition apparatus according to claim 1 or claim 2, wherein.

A program for causing a computer to function as the acoustic model learning device according to claim 3 or 4 .