JP6000094B2

JP6000094B2 - Speaker adaptation device, speaker adaptation method, and program

Info

Publication number: JP6000094B2
Application number: JP2012264067A
Authority: JP
Inventors: 拓也吉岡; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2012-12-03
Filing date: 2012-12-03
Publication date: 2016-09-28
Anticipated expiration: 2032-12-03
Also published as: JP2014109698A

Description

本発明は雑音や残響、通信路等の影響によって歪んだ音声信号を入力とする音声認識装置のための話者適応化装置、話者適応化方法、プログラムに関する。 The present invention relates to a speaker adaptation device, a speaker adaptation method, and a program for a speech recognition device that receives a speech signal distorted by the influence of noise, reverberation, a communication path, or the like.

以下、図１に示す音声認識装置を参照して、従来の特徴量補正技術について簡潔に説明する。図１は特徴量補正処理過程を含む従来の音声認識装置の構成を示すブロック図である。 Hereinafter, a conventional feature correction technique will be briefly described with reference to the speech recognition apparatus shown in FIG. FIG. 1 is a block diagram showing a configuration of a conventional speech recognition apparatus including a feature amount correction process.

図１に示すように従来の音声認識装置９は、特徴量抽出部９１と、特徴量補正部９２と、補正用音響モデル記憶部９３と、特徴量変換部９４と、音声認識デコーダ部９５と、認識用音響モデル記憶部９６と、言語モデル記憶部９７と、発音辞書記憶部９８とを含む。補正用音響モデル記憶部９３には、音響特徴量補正専用の音響モデルが記憶されている。認識用音響モデル記憶部９６には、音声認識専用の音響モデルが記憶されている。言語モデル記憶部９７には言語モデルが記憶されている。発音辞書記憶部９８には発音辞書が記憶されている。音声認識装置９は、雑音や残響、通信路等の影響によって歪んだ音声信号を入力として受け取り、発話内容を表す単語または単語系列、すなわち音声認識結果を出力する。以下では、歪んだ音声を劣化音声と呼び、劣化音声の信号を劣化信号、劣化音声の音響特徴量を劣化特徴量と呼ぶ。劣化信号は、特徴量抽出部９１に入力される。特徴量抽出部９１は、劣化信号を短時間フレームに分割し、各短時間フレームの劣化信号を音響特徴量に変換し、当該音響特徴量の時系列を出力する。以後、音響特徴量の時系列を音響特徴量系列、または特徴量系列と呼ぶ。音響特徴量としては、対数メル周波数スペクトル係数やメル周波数ケプストラム係数（MFCC）等が用いられる。以下、劣化音声の特徴量系列を劣化特徴量系列と呼ぶ。劣化特徴量系列は、特徴量補正部９２に入力される。特徴量補正部９２は、補正用音響モデル記憶部９３に記憶された音響モデルを用いて、各短時間フレームの劣化特徴量に重畳された歪みの影響を補正し、補正された音響特徴量の時系列を出力する。特徴量補正部９２は、VTS（Vector Taylor Series）強調（非特許文献１）やAlgonquin（非特許文献２）等の特徴量強調手段によって構築される。以後、歪みの影響が補正された音響特徴量を補正後特徴量と呼び、補正後特徴量の時系列を補正後特徴量系列と呼ぶ。特徴量補正部９２から出力された補正後特徴量系列は、特徴量変換部９４に入力される。特徴量変換部９４は、補正後特徴量系列を、音声認識デコーダ部９５が用いる特徴量表現の時系列に変換し、これを出力する。以後、音声認識デコーダ部９５が用いる特徴量表現を認識用特徴量と呼び、認識用特徴量の時系列を認識用特徴量系列と呼ぶ。認識用特徴量としては、MFCCにデルタケプストラムを連結したもの等が用いられる。特徴量変換部９４から出力された認識用特徴量系列は、音声認識デコーダ部９５に入力される。音声認識デコーダ部９５は、音声認識専用の音響モデル、言語モデル、発音辞書等を参照しながら、入力された認識用特徴量系列に最もよく適合する単語または単語系列を算出し、これを音声認識結果として出力する。図中、円筒シンボルで示された構成要素（９３、９６、９７、９８）は、当該構成要素が表すモデルを規定するパラメータを格納した記憶部であり、これらのパラメータが当該モデルを参照する処理部によって読み出される。 As shown in FIG. 1, the conventional speech recognition apparatus 9 includes a feature amount extraction unit 91, a feature amount correction unit 92, a correction acoustic model storage unit 93, a feature amount conversion unit 94, and a speech recognition decoder unit 95. A recognition acoustic model storage unit 96, a language model storage unit 97, and a pronunciation dictionary storage unit 98. The acoustic model storage unit 93 for correction stores an acoustic model dedicated to correcting acoustic features. The acoustic model storage unit 96 for recognition stores an acoustic model dedicated to speech recognition. The language model storage unit 97 stores language models. The pronunciation dictionary storage unit 98 stores a pronunciation dictionary. The speech recognition device 9 receives a speech signal distorted by the influence of noise, reverberation, a communication path, etc. as an input, and outputs a word or a word series representing the utterance content, that is, a speech recognition result. Hereinafter, the distorted speech is referred to as degraded speech, the degraded speech signal is referred to as a degraded signal, and the acoustic feature amount of the degraded speech is referred to as a degraded feature amount. The deterioration signal is input to the feature amount extraction unit 91. The feature quantity extraction unit 91 divides the degraded signal into short-time frames, converts the degraded signal of each short-time frame into an acoustic feature quantity, and outputs a time series of the acoustic feature quantity. Hereinafter, the time series of acoustic feature quantities is referred to as an acoustic feature quantity series or a feature quantity series. As the acoustic feature amount, a log mel frequency spectrum coefficient, a mel frequency cepstrum coefficient (MFCC), or the like is used. Hereinafter, the feature amount series of deteriorated speech is referred to as a deteriorated feature amount sequence. The deterioration feature amount series is input to the feature amount correction unit 92. The feature amount correcting unit 92 uses the acoustic model stored in the correction acoustic model storage unit 93 to correct the influence of distortion superimposed on the degradation feature amount of each short-time frame, and the corrected acoustic feature amount Output time series. The feature quantity correction unit 92 is constructed by feature quantity enhancement means such as VTS (Vector Taylor Series) enhancement (Non-Patent Document 1) and Algonquin (Non-Patent Document 2). Hereinafter, the acoustic feature quantity in which the influence of the distortion is corrected is referred to as a corrected feature quantity, and the time series of the corrected feature quantity is referred to as a corrected feature quantity series. The corrected feature quantity series output from the feature quantity correction unit 92 is input to the feature quantity conversion unit 94. The feature amount conversion unit 94 converts the corrected feature amount series into a time series of feature amount expression used by the speech recognition decoder unit 95, and outputs this. Hereinafter, the feature quantity representation used by the speech recognition decoder unit 95 is referred to as a recognition feature quantity, and the time series of recognition feature quantities is referred to as a recognition feature quantity series. As the feature quantity for recognition, a MFCC connected to a delta cepstrum is used. The recognition feature value series output from the feature value conversion unit 94 is input to the speech recognition decoder unit 95. The speech recognition decoder unit 95 calculates a word or a word sequence that best fits the input recognition feature quantity series while referring to a speech recognition-dedicated acoustic model, language model, pronunciation dictionary, etc. Output as a result. In the figure, components (93, 96, 97, 98) indicated by cylindrical symbols are storage units that store parameters that define the model represented by the component, and the processing in which these parameters refer to the model. Read by the unit.

なお、以上の記述では、音声認識装置への応用を想定して特徴量補正技術を説明したが、特徴量補正技術は、音声認識装置に限らず、音声認識プログラム、雑音抑圧装置、雑音抑圧プログラム等にも応用できる。 In the above description, the feature amount correction technology has been described assuming application to a speech recognition device. However, the feature amount correction technology is not limited to a speech recognition device, but a speech recognition program, a noise suppression device, and a noise suppression program. Etc.

P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach for environmental-independent speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., vol. 2, 1996, pp. 733-736.PJ Moreno, B. Raj, and RM Stern, “A vector Taylor series approach for environmental-independent speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Vol. 2, 1996, pp. 733 -736. B. J. Frey, L. Deng, A. Acero, and T. Kristjansson, “ALGONQUIN: iterating Laplace’s method to remove multiple types of acoustic distortion for robust speech recognition,” in Proc. Eurospeech, 2001, pp. 901-904.B. J. Frey, L. Deng, A. Acero, and T. Kristjansson, “ALGONQUIN: iterating Laplace ’s method to remove multiple types of acoustic distortion for robust speech recognition,” in Proc. Eurospeech, 2001, pp. 901-904.

前述したように、特徴量補正部９２では、音声認識デコーダ部９５が用いるものとは別の、歪みの影響を補正するために用いられる音響モデルが用いられる。以後、この音響モデルを補正用音響モデルと呼ぶ。また、音声認識デコーダ部９５が参照する音声認識専用の音響モデルを認識用音響モデルと呼び、補正用音響モデルと区別する。補正用音響モデルは混合正規分布を用いて表現され、隠れマルコフモデルに基づく認識用音響モデルよりも単純な構造をもつ。補正用音響モデルは、歪みを含まないクリーンな音声の短時間フレームにおける音響特徴量の分布を表現したものであり、音声のコーパスを用いて事前に学習される。 As described above, the feature amount correction unit 92 uses an acoustic model that is used to correct the influence of distortion, which is different from that used by the speech recognition decoder unit 95. Hereinafter, this acoustic model is referred to as a correction acoustic model. The acoustic model dedicated to speech recognition referred to by the speech recognition decoder unit 95 is called a recognition acoustic model and is distinguished from the correction acoustic model. The acoustic model for correction is expressed using a mixed normal distribution and has a simpler structure than the acoustic model for recognition based on the hidden Markov model. The acoustic model for correction expresses the distribution of acoustic features in a short frame of clean speech that does not include distortion, and is learned in advance using a speech corpus.

音声認識装置の信頼性を高めるためには、予め録音された使用者の音声を大量に用いて補正用音響モデルを学習するのが望ましい。このような補正用音響モデルは特定話者モデルと呼ばれる。しかしながら、使用者の音声を事前に大量に収集しておくのは実際には困難であるため、多くの場合、一名以上の不特定話者の音声を用いて補正用音響モデルを学習する。このような補正用音響モデルは不特定話者モデルと呼ばれる。不特定話者モデルも一定の特徴量補正効果をもつが、不特定話者モデルを用いて特徴量補正して得られる音声認識精度は、特定話者モデルによって得られる認識精度に劣る。 In order to increase the reliability of the speech recognition apparatus, it is desirable to learn the correction acoustic model using a large amount of user's voices recorded in advance. Such a correction acoustic model is called a specific speaker model. However, since it is actually difficult to collect a large amount of user's voice in advance, in many cases, the correction acoustic model is learned using the voice of one or more unspecified speakers. Such a correction acoustic model is called an unspecified speaker model. The unspecified speaker model also has a certain feature amount correction effect, but the speech recognition accuracy obtained by correcting the feature amount using the unspecified speaker model is inferior to the recognition accuracy obtained by the specific speaker model.

一方、音声認識装置の使用者から得られた少量の発話データを用いて、不特定話者の「認識用」音響モデルを、当該使用者の音声の特性に適合するように修正する話者適応化技術が知られている。具体的には、話者適応化技術は、上記少量の発話データから抽出された音響特徴量の集合と不特定話者モデルを規定する認識用音響モデルのパラメータの値の集合を入力として受け取り、当該使用者の音響的特性に適合するようにパラメータの値を修正し、修正されたパラメータの値の集合を出力する。話者適応化前の補正用音響モデル、すなわち不特定話者モデルを規定するパラメータの値を適応前パラメータ値、話者適応化されたパラメータの値を適応後パラメータ値と呼ぶ。 On the other hand, speaker adaptation that modifies the “recognition” acoustic model of an unspecified speaker to match the characteristics of the user's speech using a small amount of speech data obtained from the user of the speech recognition device Technology is known. Specifically, the speaker adaptation technology receives as input a set of acoustic feature values extracted from the small amount of utterance data and a set of parameter values of a recognition acoustic model that defines an unspecified speaker model, The parameter values are modified to match the acoustic characteristics of the user, and a set of modified parameter values is output. A parameter value that defines a correction acoustic model before speaker adaptation, that is, an unspecified speaker model, is referred to as a pre-adaptation parameter value, and a parameter value after speaker adaptation is referred to as a post-adaptation parameter value.

しかしながら、従来の「認識用」音響モデルのための話者適応化技術は、そのままでは「補正用」音響モデルの話者適応化に用いることができない。例えば、使用者の音声を事前に収集する方策として、その使用者が過去に音声認識装置を用いた際に記録された音声を用いることが考えられる。ところが、実際にはこれらの音声は歪みを含むので、話者適応化に用いられる音響特徴量は劣化した特徴量である。クリーンな音響特徴量を表現する補正用音響モデルの話者適応化に、これら劣化特徴量を用いても音声認識精度は改善されない。そこで、本発明では、音声認識精度を効果的に改善できるよう適切に話者適応化された補正用音響モデルを得ることを目的とした話者適応化装置を提供することを目的とする。 However, the conventional speaker adaptation technology for the “recognition” acoustic model cannot be used as it is for speaker adaptation of the “correction” acoustic model. For example, as a measure for collecting user's voice in advance, it is conceivable to use voice recorded when the user has used a voice recognition device in the past. However, since these voices actually contain distortion, the acoustic feature quantity used for speaker adaptation is a deteriorated feature quantity. Even if these deteriorated feature quantities are used for speaker adaptation of a correction acoustic model that expresses clean acoustic feature quantities, the speech recognition accuracy is not improved. Accordingly, an object of the present invention is to provide a speaker adaptation device for obtaining a correction acoustic model that is appropriately adapted to a speaker so that speech recognition accuracy can be effectively improved.

本発明の話者適応化装置は、フレーム選択部と、パラメータ修正部とを含む。歪んだ音声の音響特徴量を劣化特徴量と呼ぶ。歪みの影響を補正するために用いられる音響モデルを補正用音響モデルと呼ぶ。不特定話者の音声を用いて学習された補正用音響モデルを不特定話者モデルと呼ぶ。フレーム選択部は、事前に記録された劣化特徴量の集合を入力とし、劣化特徴量の集合から劣化の度合いが小さい劣化特徴量を抽出し、抽出した劣化特徴量の集合を適応用特徴量の集合として出力する。パラメータ修正部は、適応用特徴量の集合と不特定話者モデルを規定するパラメータ値である適応前パラメータ値の集合とを入力とし、適応前パラメータ値の集合を適応用特徴量に適合するように修正（話者適応化）し、話者適応化されたパラメータの値である適応後パラメータ値の集合を出力する。 The speaker adaptation apparatus of the present invention includes a frame selection unit and a parameter correction unit. The acoustic feature quantity of distorted speech is called a deterioration feature quantity. An acoustic model used for correcting the influence of distortion is referred to as a correcting acoustic model. A correction acoustic model learned using the voice of an unspecified speaker is called an unspecified speaker model. The frame selection unit receives a set of pre-recorded deterioration feature values, extracts a deterioration feature value with a low degree of deterioration from the set of deterioration feature values, and uses the extracted set of deterioration feature values as adaptation feature values. Output as a set. The parameter correction unit receives a set of feature values for adaptation and a set of pre-adaptation parameter values that are parameter values that define an unspecified speaker model, and adjusts the set of pre-adaptation parameter values to the feature values for adaptation. And a set of post-adaptation parameter values that are speaker-adapted parameter values are output.

本発明の話者適応化装置によれば、適切に話者適応化された補正用音響モデルを得ることができる。 According to the speaker adaptation apparatus of the present invention, it is possible to obtain a correction acoustic model that is appropriately adapted to a speaker.

特徴量補正処理過程を含む従来の音声認識装置の構成を示すブロック図。The block diagram which shows the structure of the conventional speech recognition apparatus containing a feature-value correction process process. 本発明の実施例１に係る話者適応化装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a speaker adaptation device according to Embodiment 1 of the present invention. 本発明の実施例１に係る話者適応化装置の動作を示すフローチャート。The flowchart which shows operation | movement of the speaker adaptation apparatus which concerns on Example 1 of this invention. 実施例１のフレーム選択部の構成例を示すブロック図。FIG. 3 is a block diagram illustrating a configuration example of a frame selection unit according to the first embodiment. 実施例１のフレーム選択部の動作例を示すフローチャート。5 is a flowchart illustrating an operation example of a frame selection unit according to the first embodiment. 実施例１のパラメータ修正部の構成例を示すブロック図。FIG. 3 is a block diagram illustrating a configuration example of a parameter correction unit according to the first embodiment. 実施例１のパラメータ修正部の動作例を示すフローチャート。5 is a flowchart illustrating an operation example of a parameter correction unit according to the first embodiment. 本発明の実施例２に係る話者適応化装置の構成を示すブロック図。The block diagram which shows the structure of the speaker adaptation apparatus which concerns on Example 2 of this invention. 本発明の実施例２に係る話者適応化装置の動作を示すフローチャート。The flowchart which shows operation | movement of the speaker adaptation apparatus which concerns on Example 2 of this invention. 実施例２のフレーム選択部の構成例を示すブロック図。FIG. 6 is a block diagram illustrating a configuration example of a frame selection unit according to the second embodiment. 実施例２のフレーム選択部の動作例を示すフローチャート。9 is a flowchart illustrating an operation example of a frame selection unit according to the second embodiment. 実施例２の残響時間推定手段の構成例を示すブロック図。FIG. 6 is a block diagram illustrating a configuration example of a reverberation time estimation unit according to the second embodiment. 実施例２の残響時間推定手段の動作例を示すフローチャート。10 is a flowchart illustrating an operation example of a reverberation time estimation unit according to the second embodiment. 話者適応化装置を音声認識装置と組み合わせた構成を示すブロック図。The block diagram which shows the structure which combined the speaker adaptation apparatus with the speech recognition apparatus. 本発明の話者適応化装置を実現するコンピュータの構成を示すブロック図。The block diagram which shows the structure of the computer which implement | achieves the speaker adaptation apparatus of this invention.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。なお、以下に登場するI,N_i,K,Nはいずれも１以上の整数とし、i,k,nはいずれも整数とする。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function, and duplication description is abbreviate | omitted. Note that I, N _i , K, and N appearing below are all integers of 1 or more, and i, k, and n are all integers.

以下、図２、３を参照して本発明の実施例１に係る話者適応化装置について説明する。図２は本発明の実施例１に係る話者適応化装置の構成を示すブロック図である。図３は本発明の実施例１に係る話者適応化装置の動作を示すフローチャートである。 A speaker adaptation apparatus according to Embodiment 1 of the present invention will be described below with reference to FIGS. FIG. 2 is a block diagram showing the configuration of the speaker adaptation apparatus according to Embodiment 1 of the present invention. FIG. 3 is a flowchart showing the operation of the speaker adaptation apparatus according to Embodiment 1 of the present invention.

話者適応化装置１への入力は、事前に録音された使用者の劣化音声信号の短時間フレームから得られた音響特徴量（音響特徴量はD次元のベクトル）、すなわち劣化特徴量の集合Y={y_i(n);1<i<I,1<n<N_i}、及び不特定話者モデルを規定する適応前パラメータ値の集合Λである。ただし、IはYに含まれる発話の個数、各N_iはi番目の発話に含まれる短時間フレームの個数である。本発明では、補正用音響モデルは混合正規分布の形式で表現されているものと仮定する。すなわち、Λは、各正規分布に対する重み係数ω_k(1<k<K)、平均ベクトルm_k(1<k<K)、精度行列R_k(1<k<K)からなり、Λ={ω_k,m_k,R_k;1<k<K}と書ける。ただし、Kは混合正規分布に含まれる正規分布の個数である。話者適応化装置からの出力は、話者適応化処理後の修正された補正用音響モデルを規定する適応後パラメータ値の集合Θである。Θは、修正された重み係数~ω_k(1<k<K)、修正された平均ベクトル~m_k(1<k<K)、修正された精度行列~R_k(1<k<K)からなり、Θ={~ω_k,~m_k,~R_k;1<k<K}と書ける。 The input to the speaker adaptation apparatus 1 is an acoustic feature value (acoustic feature value is a D-dimensional vector) obtained from a short-time frame of a user's recorded speech signal that has been recorded in advance, that is, a set of degraded feature values. Y = {y _{i (n)} ; 1 < i < I, 1 < n < N _i }, and a set Λ of pre-adaptation parameter values that define an unspecified speaker model. Here, I is the number of utterances included in Y, and each _Ni is the number of short-time frames included in the i-th utterance. In the present invention, it is assumed that the acoustic model for correction is expressed in the form of a mixed normal distribution. That is, Λ consists of a weighting coefficient ω _k (1 < k < K) for each normal distribution, an average vector m _k (1 < k < K), and an accuracy matrix R _k (1 < k < K), and Λ = { ω _k , m _k , R _k ; 1 < k < K}. Here, K is the number of normal distributions included in the mixed normal distribution. The output from the speaker adaptation device is a set Θ of post-adaptation parameter values that defines the corrected acoustic model for correction after the speaker adaptation process. Θ is the modified weighting factor ~ ω _k (1 < k < K), the modified mean vector ~ m _k (1 < k < K), the modified accuracy matrix ~ R _k (1 < k < K) And can be written as Θ = {~ ω _k , ~ m _k , ~ R _k ; 1 < k < K}.

図２に示すように本実施例の話者適応化装置１は、フレーム選択部１１と、パラメータ修正部１２とを含む。話者適応化装置１に入力された劣化特徴量の集合Yは、フレーム選択部１１に入力される。フレーム選択部１１は、Yの中から、SN比が高く、話者適応化に使用できる音響特徴量の集合X={x_n;1<n<N}を選択し、これを出力する（Ｓ１１、フレーム選択ステップ）。XはYの部分集合であるため、X⊆Y、N<N₁+...+N_Iが成り立つ。具体的には、フレーム選択部１１は、各y_i(n)(1<i<I,1<n<N_i)のSN比の値あるいはSN比に相関のある指標の値を算出する。そして、その算出された値が所定の閾値より大きい場合に限り、y_i(n)を劣化の度合いが小さいために補正用音響モデルの話者適応化に使用できると判定して、y_i(n)をXに含める。Xに含まれる音響特徴量を適応用特徴量と呼ぶ。 As shown in FIG. 2, the speaker adaptation apparatus 1 of the present embodiment includes a frame selection unit 11 and a parameter correction unit 12. The degradation feature amount set Y input to the speaker adaptation apparatus 1 is input to the frame selection unit 11. The frame selection unit 11 selects a set of acoustic features X = {x _n ; 1 < n < N} that has a high SN ratio and can be used for speaker adaptation from Y, and outputs this (S11). Frame selection step). Since X is a subset of Y, X⊆Y, N < N ₁ + ... + N _I holds. Specifically, the frame selection unit 11 calculates the SN ratio value of each y _{i (n)} (1 < i < I, 1 < n < N _i ) or the index value correlated with the SN ratio. Then, only when the calculated value is larger than a predetermined threshold, it is determined that y _{i (n)} can be used for speaker adaptation of the correction acoustic model because the degree of deterioration is small, and y _{i (} Include _n) in X. The acoustic feature quantity included in X is called the adaptation feature quantity.

フレーム選択部１１から出力された適応用特徴量の集合Xは、適応前パラメータ値の集合Λとともにパラメータ修正部１２に入力される。パラメータ修正部１２は、認識用音響モデルを対象とした任意の話者適応化手段の一つを用いて、集合Xに基づいてΛに含まれるパラメータの値の集合を適応用特徴量に適合するように話者適応化して適応後パラメータ値の集合Θを算出する（Ｓ１２、パラメータ修正ステップ）。こうして求められたΘは、話者適応化装置１から出力される。フレーム選択部１２で用いることのできる話者適応化手段として、例えば最大事後確率適応法（参考非特許文献１）や最尤線形回帰法（参考非特許文献２）が知られている。
（参考非特許文献１）J. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains,” IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 291-298, 1994.
（参考非特許文献２）C. Legetter and P. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Language, vol. 9, no. 2, pp. 171-185, 1995. The adaptation feature amount set X output from the frame selection unit 11 is input to the parameter correction unit 12 together with the pre-adaptation parameter value set Λ. The parameter correction unit 12 uses one of the speaker adaptation means for the recognition acoustic model and adapts the set of parameter values included in Λ to the adaptation feature amount based on the set X. Thus, speaker adaptation is performed to calculate a set Θ of parameter values after adaptation (S12, parameter correction step). Θ thus obtained is output from the speaker adaptation apparatus 1. As speaker adaptation means that can be used in the frame selection unit 12, for example, the maximum posterior probability adaptation method (reference non-patent document 1) and the maximum likelihood linear regression method (reference non-patent document 2) are known.
(Non-patent document 1) J. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains,” IEEE Trans. Speech Audio Process., Vol. 2, no. 2, pp 291-298, 1994.
(Reference Non-Patent Document 2) C. Legetter and P. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Language, vol. 9, no. 2, pp. 171-185, 1995.

＜フレーム選択部１１：加法性雑音環境を想定した構成例＞
以下、図４、図５を参照してフレーム選択部１１の構成例を説明する。図４は本実施例のフレーム選択部１１の構成例を示すブロック図である。図５は本実施例のフレーム選択部１１の動作例を示すフローチャートである。図４に示すように、本実施例のフレーム選択部１１は例えば、雑音推定手段１１１と、SN比推定手段１１２と、第一閾値処理手段１１３とを含む。図４に示すフレーム選択部１１の構成例は、加法性雑音環境での使用を想定したものであり、適応前パラメータ値の集合Λを用いず、劣化特徴量の集合Yのみを入力として受け取る。 <Frame selection unit 11: configuration example assuming additive noise environment>
Hereinafter, a configuration example of the frame selection unit 11 will be described with reference to FIGS. 4 and 5. FIG. 4 is a block diagram illustrating a configuration example of the frame selection unit 11 of the present embodiment. FIG. 5 is a flowchart showing an operation example of the frame selection unit 11 of the present embodiment. As illustrated in FIG. 4, the frame selection unit 11 according to the present exemplary embodiment includes, for example, a noise estimation unit 111, an SN ratio estimation unit 112, and a first threshold processing unit 113. The configuration example of the frame selection unit 11 shown in FIG. 4 is assumed to be used in an additive noise environment, and does not use the pre-adaptation parameter value set Λ, and receives only the degradation feature amount set Y as an input.

雑音推定手段１１１は、劣化特徴量の集合Yを入力として受け取り、公知の雑音推定方法を用いて、Yに含まれる雑音の音響特徴量である雑音特徴量d_i(n)を推定し、算出された雑音特徴量の集合{d_i(n);1<i<I,1<n<N_i}を出力する。（ＳＳ１１１、雑音推定サブステップ）。雑音推定方法としては、音声区間検出を用いる方法（参考非特許文献３）、劣化音声の各発話の冒頭の数フレームは雑音だけからなると仮定する方法（参考非特許文献４）、IMCRA法（参考非特許文献５）等が知られている。
（参考非特許文献３）S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1978.
（参考非特許文献４）Y. Ephraim, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,”IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 6, pp. 1109-1121, 1984.
（参考非特許文献５）I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,”IEEE Trans. SAP, vol. 11, no. 5, pp. 466-475, 2003. The noise estimation unit 111 receives a set Y of deteriorated feature quantities as an input, and estimates and calculates a noise feature quantity d _{i (n)} that is an acoustic feature quantity of noise included in Y using a known noise estimation method. A set {d _{i (n)} ; 1 < i < I, 1 < n < N _i } of the obtained noise feature values is output. (SS111, noise estimation substep). As a noise estimation method, a method using speech segment detection (reference non-patent document 3), a method assuming that the first few frames of each utterance of degraded speech are composed of noise only (reference non-patent document 4), an IMCRA method (reference) Non-patent document 5) and the like are known.
(Reference Non-Patent Document 3) SF Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., Vol. 27, no. 2, pp. 113-120, 1978.
(Non-patent document 4) Y. Ephraim, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process., Vol. 32, no. 6, pp. 1109-1121, 1984.
(Non-patent document 5) I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. SAP, vol. 11, no. 5, pp. 466-475, 2003.

例えば、２つめに挙げた方法では、次式にしたがってd_i(n)を算出する。 For example, in the second method, di _(n) is calculated according to the following equation.

ただし、各D_iは雑音だけからなると想定される発話冒頭のフレーム数である。SN比推定手段１１２は、劣化特徴量の集合Yと雑音特徴量の集合{d_i(n);1<i<I,1<n<N_i}を入力として受け取り、各音響特徴量y_i(n)について、次式で定義されるスカラ値r_i(n)を算出し、算出されたSN比の集合{r_i(n);1<i<I,1<n<N_i}を出力する（ＳＳ１１２、SN比推定サブステップ）。 However, each D _i is the number of frames at the beginning of the utterance that is assumed to consist only of noise. The SN ratio estimation unit 112 receives the deterioration feature quantity set Y and the noise feature quantity set {d _{i (n)} ; 1 < i < I, 1 < n < N _i } as inputs, and receives each acoustic feature quantity y _{i. For (n)} , the scalar value r _{i (n)} defined by the following equation is calculated, and the set of calculated SNRs {r _{i (n)} ; 1 < i < I, 1 < n < N _i } Output (SS112, SN ratio estimation substep).

ただし、exp(・)と|・|は、それぞれ指数関数とベクトルのノルムを表す。この値はSN比の推定値であり、以後単にSN比と呼ぶ。第一閾値処理部１１３は、劣化特徴量の集合YとSN比の集合{r_i(n);1<i<I,1<n<N_i}を入力として受け取り、対応するSN比r_i(n)が予め定めた閾値Hより大きい音響特徴量y_i(n)の集合X={y_i(n);r_i(n)>H,1<i<I,1<n<N_i}を求め、当該適応用特徴量の集合Xを出力する（ＳＳ１１３、第一閾値処理サブステップ）。 Here, exp (·) and | · | represent the norm of the exponential function and the vector, respectively. This value is an estimated value of the SN ratio, and is simply referred to as an SN ratio hereinafter. The first threshold value processing unit 113 receives the degradation feature quantity set Y and the SN ratio set {ri _(n) ; 1 < i < I, 1 < n < _Ni } as inputs, and the corresponding SN ratio r _{i A} set of acoustic features y _i _(n) where _(n) is greater than a predetermined threshold H == y _{i (n)} ; r _{i (n)} > H, 1 < i < I, 1 < n < N _i } And output the feature feature set X (SS113, first threshold value processing sub-step).

＜パラメータ修正部１２：最大事後確率適応法を利用する構成例＞
以下、図６、図７を参照してパラメータ修正部１２の構成例について説明する。図６は本実施例のパラメータ修正部１２の構成例を示すブロック図である。図７は本実施例のパラメータ修正部１２の動作例を示すフローチャートである。図６に示すように、本実施例のパラメータ修正部１２は、例えばパラメータ初期化手段１２１と、超母数設定手段１２２と、カウント初期化手段１２３と、分配率計算手段１２４と、パラメータ更新手段１２５と、カウント増加手段１２６と、収束判定手段１２７とを含む。 <Parameter Correction Unit 12: Configuration Example Using Maximum A posteriori Probability Adaptation Method>
Hereinafter, a configuration example of the parameter correction unit 12 will be described with reference to FIGS. 6 and 7. FIG. 6 is a block diagram illustrating a configuration example of the parameter correction unit 12 of the present embodiment. FIG. 7 is a flowchart showing an operation example of the parameter correction unit 12 of this embodiment. As shown in FIG. 6, the parameter correction unit 12 of the present embodiment includes, for example, a parameter initialization unit 121, a super parameter setting unit 122, a count initialization unit 123, a distribution rate calculation unit 124, and a parameter update unit. 125, a count increase means 126, and a convergence determination means 127.

図６、７に示したパラメータ修正部１２の構成例は最大事後確率適応法を利用する構成例である。図６、７に示したパラメータ修正部１２の処理の詳細については、背景となる考え方を含めて参考非特許文献１において説明されているので、本明細書では処理の流れのみ記述する。なお、最大事後確率適応法は、話者適応化技術の一つの具体例として取り上げて説明しているにすぎず、最尤線形回帰法等の他の話者適応化技術を用いてパラメータ修正部１２を実現することもできる。 The configuration example of the parameter correction unit 12 shown in FIGS. 6 and 7 is a configuration example using the maximum posterior probability adaptation method. The details of the processing of the parameter correction unit 12 shown in FIGS. 6 and 7 are described in Reference Non-Patent Document 1 including the background concept, so only the processing flow is described in this specification. Note that the maximum a posteriori probability adaptation method is merely described as one specific example of speaker adaptation technology, and the parameter correction unit is used by using another speaker adaptation technology such as maximum likelihood linear regression. 12 can also be realized.

パラメータ初期化手段１２１は、修正後のパラメータ~ω_k(1<k<K)、~m_k(1<k<K)、~R_k(1<k<K)を初期化する（ＳＳ１２１、パラメータ初期化サブステップ）。初期化の方法として例えば、次式のように不特定話者モデルのパラメータを用いることができる。 The parameter initialization means 121 initializes the corrected parameters ~ ω _k (1 < k < K), ~ m _k (1 < k < K), ~ R _k (1 < k < K) (SS121, Parameter initialization sub-step). As an initialization method, for example, parameters of an unspecified speaker model can be used as in the following equation.

ただし、初期化の方法は上式に限定されるものではなく、他の任意の方法を用いて修正後のパラメータを初期化してよい。次に、超母数設定手段１２２は、最大事後確率適応に用いる事前分布の超母数（ハイパーパラメータ）の値を設定する（ＳＳ１２２、超母数設定サブステップ）。重み係数の事前分布としてディリクレ分布、平均ベクトルと精度行列の結合事前分布として正規-ウィシャート分布を用いる。すなわち、重み係数の事前分布g(ω₁,・・・,ω_K)と平均ベクトルと精度行列の結合事前分布g(m_k,R_k)は、それぞれ次式で与えられる。 However, the initialization method is not limited to the above equation, and the corrected parameters may be initialized using any other method. Next, the super parameter setting means 122 sets the value of the super parameter (hyper parameter) of the prior distribution used for the maximum posterior probability adaptation (SS 122, super parameter setting sub-step). The Dirichlet distribution is used as the prior distribution of the weighting coefficient, and the normal-Wishart distribution is used as the combined prior distribution of the mean vector and the accuracy matrix. That is, the prior distribution g (ω ₁ ,..., Ω _K ) of the weighting coefficient and the combined prior distribution g (m _k , R _k ) of the average vector and the accuracy matrix are given by the following equations, respectively.

tr(・)は、行列のトレースを表す。また、(v_k,τ_k,μ_k,α_k,U_k;1<k<K)が事前分布を規定する超母数である。これらの値は、例えば次のように設定される。 tr (·) represents a matrix trace. Further, (v _k , τ _k , μ _k , α _k , U _k ; 1 < k < K) is a super parameter that defines the prior distribution. These values are set as follows, for example.

ただし、超母数の値はこれらに限定されるものではなく、自由に設定してよい。最大事後確率適応法はEMアルゴリズムに基づいており、繰り返し処理を含む。次に、カウント初期化手段１２３は、繰り返し処理を開始する前に繰り返しカウントCを1に設定する（ＳＳ１２３、カウント初期化サブステップ）。次に、分配率計算手段１２４は、1<k<K、1<n<Nについて、次式で定義される分配率c_knを計算する（ＳＳ１２４、分配率計算サブステップ）。 However, the value of the super parameter is not limited to these and may be set freely. The maximum posterior probability adaptation method is based on the EM algorithm and includes iterative processing. Next, the count initialization unit 123 sets the repetition count C to 1 before starting the repetition process (SS123, count initialization sub-step). Next, the distribution rate calculation means 124 calculates a distribution rate c _kn defined by the following equation for 1 < k < K and 1 < n < N (SS124, distribution rate calculation substep).

次に、パラメータ更新手段１２５は、修正されたパラメータを次式にしたがって更新する（ＳＳ１２５、パラメータ更新サブステップ）。 Next, the parameter update unit 125 updates the corrected parameter according to the following equation (SS125, parameter update substep).

ただし、パラメータ更新サブステップでは、必ずしもすべての修正されたパラメータを更新しなくてもよい。例えば、式(15)によって修正された平均ベクトルだけを更新する構成にしてもよい。カウント増加手段１２６は、C=C+1を計算して繰り返しカウントを1だけ増やす（ＳＳ１２６、カウント増加サブステップ）。収束判定手段１２７は、EMアルゴリズムが収束しているかどうかを判定し、収束していなければ（ＳＳ１２７ＮＯ）分配率計算サブステップ（ＳＳ１２４）に戻る。収束していれば（ＳＳ１２７ＹＥＳ）フローを終了する。収束判定は、例えば繰り返しカウントが閾値C_maxを超えているか否かに基づいて判定することができる。ただし、収束条件はこれに限定されるものではなく、繰り返しで変化したパラメータの変化量等に基づいて収束したか否かを判定してもよい。 However, in the parameter update substep, it is not always necessary to update all modified parameters. For example, only the average vector corrected by equation (15) may be updated. The count increment means 126 calculates C = C + 1 and increments the repeat count by 1 (SS126, count increment substep). The convergence determination unit 127 determines whether or not the EM algorithm has converged. If the EM algorithm has not converged (SS127 NO), the process returns to the distribution ratio calculation substep (SS124). If it has converged (SS127 YES), the flow ends. The convergence determination can be performed based on, for example, whether or not the repetition count exceeds the threshold value C _max . However, the convergence condition is not limited to this, and it may be determined whether or not the convergence has occurred based on the amount of change of the parameter that has been changed repeatedly.

以下、図８から図１１を参照して、実施例２に係る話者適応装置について説明する。図８は本発明の実施例２に係る話者適応化装置２の構成を示すブロック図である。図９は本発明の実施例２に係る話者適応化装置２の動作を示すフローチャートである。図１０は本実施例のフレーム選択部２１の構成例を示すブロック図である。図１１は本実施例のフレーム選択部２１の動作例を示すフローチャートである。 Hereinafter, the speaker adaptation apparatus according to the second embodiment will be described with reference to FIGS. 8 to 11. FIG. 8 is a block diagram showing the configuration of the speaker adaptation apparatus 2 according to the second embodiment of the present invention. FIG. 9 is a flowchart showing the operation of the speaker adaptation apparatus 2 according to the second embodiment of the present invention. FIG. 10 is a block diagram illustrating a configuration example of the frame selection unit 21 according to the present embodiment. FIG. 11 is a flowchart showing an operation example of the frame selection unit 21 of the present embodiment.

図８に示すように、本実施例の話者適応装置２はフレーム選択部２１とパラメータ修正部１２とを含む。パラメータ修正部１２は実施例１のパラメータ修正部１２と同一であるから説明を略する。以下、実施例１との相違点であるフレーム選択部２１について説明する。 As shown in FIG. 8, the speaker adaptation device 2 of this embodiment includes a frame selection unit 21 and a parameter correction unit 12. Since the parameter correction unit 12 is the same as the parameter correction unit 12 of the first embodiment, description thereof is omitted. Hereinafter, the frame selection unit 21 that is different from the first embodiment will be described.

＜フレーム選択部２１：残響環境を想定した構成例＞
本実施例のフレーム選択部２１は、残響環境での使用を想定したものであり、適応前パラメータ値の集合Λと劣化特徴量の集合Yを入力として受け取る。フレーム選択部２１は、Yの中から、残響時間が短く、話者適応化に使用できる音響特徴量の集合X={x_n;1<n<N}を選択し、これを出力する（Ｓ２１、フレーム選択ステップ）。具体的には、フレーム選択部２１は入力された劣化特徴量の集合Yと適応前パラメータ値の集合Λとを用いて、各発話i (1<i<I)の残響時間を推定する。そして推定された残響時間が所定の閾値より小さい場合に限り、y_i(n)は劣化の度合いが小さい劣化特徴量であり、補正用音響モデルの話者適応化に使用できると判定して、y_i(n)をXに含める。 <Frame selection unit 21: configuration example assuming reverberation environment>
The frame selection unit 21 according to the present embodiment is assumed to be used in a reverberant environment, and receives a set Λ of pre-adaptation parameter values and a set Y of deteriorated feature values as inputs. The frame selection unit 21 selects a set of acoustic features X = {x _n ; 1 < n < N} that can be used for speaker adaptation from Y with a short reverberation time, and outputs this (S21). Frame selection step). Specifically, the frame selection unit 21 estimates the reverberation time of each utterance i (1 < i < I) using the input set Y of deteriorated feature quantities and the set of pre-adaptation parameter values Λ. Then, only when the estimated reverberation time is smaller than a predetermined threshold, y _{i (n)} is a deterioration feature amount with a small degree of deterioration, and is determined to be usable for speaker adaptation of the correction acoustic model, Include y _{i (n)} in X.

より詳細には、図１０に示すように、フレーム選択部２１は、残響時間推定手段２１１と、第二閾値処理手段２１２とを含む。残響時間推定手段２１１は、適応前パラメータ値の集合Λと劣化特徴量の集合Yを入力として受け取り、公知の残響時間推定方法を用いて、各発話の劣化特徴量{y_i(1),・・・,y_i(Ni)}から残響時間T₆₀(i)を推定し、推定された残響時間の集合{T₆₀(i);1<i<I}を出力する（ＳＳ２１１、残響時間推定サブステップ）。残響時間推定は、例えば参考非特許文献６に記載の残響時間推定方法を用いる、または参考非特許文献７に記載の残響補正方法を援用する、等によって実施できる。
（参考非特許文献６）R. Ratnam, D. L. Jones, B. C. Wheeler, W. D. O’Brien Jr., C. R. Lansing, and A. S. Feng, “Blind estimation of reverberation time,” J. Acoustical Society of America, vol. 114, no. 5, pp. 2877-2892, 2003.
（参考非特許文献７）吉岡拓也, 中谷智広, “高即応・高精度な歪み特徴量モデルの推定のための動的静的アプローチ,” vol. 2011-SLP-89, no. 22, 2011. More specifically, as shown in FIG. 10, the frame selection unit 21 includes a reverberation time estimation unit 211 and a second threshold processing unit 212. The reverberation time estimation means 211 receives as input the set Λ of pre-adaptation parameter values and the set Y of deterioration features, and uses a known reverberation time estimation method to determine the deterioration feature amount {y _{i (1)} ,. · estimates y _{i (Ni)}} from the reverberation time T ₆₀ (i), the set of estimated reverberation time {T ₆₀ (i); outputting the 1 <i <I} (SS211 , reverberation time estimation Sub-step). The reverberation time estimation can be performed, for example, by using the reverberation time estimation method described in Reference Non-Patent Document 6 or using the reverberation correction method described in Reference Non-Patent Document 7.
(Reference Non-Patent Document 6) R. Ratnam, DL Jones, BC Wheeler, WD O'Brien Jr., CR Lansing, and AS Feng, “Blind estimation of reverberation time,” J. Acoustical Society of America, vol. 114, no. 5, pp. 2877-2892, 2003.
(Reference Non-Patent Document 7) Takuya Yoshioka, Tomohiro Nakatani, “Dynamic Static Approach for Estimating Highly Responsive and Accurate Strain Feature Models,” vol. 2011-SLP-89, no. 22, 2011.

あるいは、自動的に残響時間を推定する代わりに、人間に当該音声を聞かせて残響の程度を判断させてもよい。残響時間が長くなるにつれて音声に重畳される歪みが大きくなるので、残響時間はSN比に逆相関のある指標と見做せる。従って、第二閾値処理手段２１２は、劣化特徴量の集合Yと残響時間の集合{T₆₀(i);1<i<I}を入力として受け取り、対応する残響時間T₆₀(i)が予め定めた閾値Rより小さい音響特徴量y_i(n)の集合X={y_i(n);T₆₀(i)<R,1<i<I,1<n<N_i}を求め、算出された適応用特徴量の集合Xを出力する（ＳＳ２１２、第二閾値処理サブステップ）。 Alternatively, instead of automatically estimating the reverberation time, a person may be allowed to hear the sound and determine the degree of reverberation. As the reverberation time becomes longer, the distortion superimposed on the voice increases, so the reverberation time can be regarded as an index having an inverse correlation with the SN ratio. Therefore, the second threshold processing means 212 receives the degradation feature quantity set Y and the reverberation time set {T ₆₀ (i); 1 < i < I} as inputs, and the corresponding reverberation time T ₆₀ (i) set of predetermined threshold R is smaller than acoustic features y _{i (n)} X =; seek _{_{{y i (n) T 60}} (i) <R, 1 <i <I, 1 <n <n i}, computed The set X of feature values for adaptation is output (SS212, second threshold value processing sub-step).

＜残響時間推定手段２１１：参考非特許文献７の残響補正方法を援用する構成例＞
以下、図１２、図１３を参照して残響時間推定手段２１１の構成例について説明する。図１２は本実施例の残響時間推定手段２１１の構成例を示すブロック図である。図１３は本実施例の残響時間推定手段２１１の動作例を示すフローチャートである。 <Reverberation time estimation means 211: configuration example using the reverberation correction method of Reference Non-Patent Document 7>
Hereinafter, a configuration example of the reverberation time estimation unit 211 will be described with reference to FIGS. 12 and 13. FIG. 12 is a block diagram illustrating a configuration example of the reverberation time estimation unit 211 of the present embodiment. FIG. 13 is a flowchart showing an operation example of the reverberation time estimation means 211 of the present embodiment.

残響時間推定手段２１１は、参考非特許文献７の残響補正方法を援用して残響時間T₆₀(i)を推定する構成例である。残響補正方法（参考非特許文献７）は減衰率b_i、歪み分散σ_i、シフトh_iと呼ぶ３種類のパラメータを推定するが、減衰率は残響時間と密接な関連をもつ。ただし、これらのパラメータは音響特徴量y_i(n)と同じ次元をもつ。そこで、参考非特許文献７の方法を用いて求められた減衰率b_iを通じて、残響時間T₆₀(i)を計算する。ここでは、表記を簡潔にするために、発話を表すインデクスiを省略する。なお、インデクスiは１以上の整数である。また、音響特徴量として対数メルフィルタバンクを用い、補正用音響モデルの精度行列は対角行列、すなわちR_k=diag(r_k)(1<k<K)であると想定する。各r_kは精度行列の対角成分からなるベクトル、diag(・)はベクトルを対角行列に変換する演算を表す。 The reverberation time estimation unit 211 is a configuration example that estimates the reverberation time T ₆₀ (i) by using the reverberation correction method of Reference Non-Patent Document 7. The reverberation correction method (reference non-patent document 7) estimates three types of parameters called attenuation rate b _i , distortion variance σ _i , and shift h _i, and the attenuation rate is closely related to the reverberation time. However, these parameters have the same dimensions as the acoustic feature y _{i (n)} . Therefore, the reverberation time T ₆₀ (i) is calculated through the attenuation rate b _i obtained using the method of Reference Non-Patent Document 7. Here, in order to simplify the notation, the index i representing the utterance is omitted. The index i is an integer of 1 or more. Further, it is assumed that a log mel filter bank is used as the acoustic feature quantity, and the accuracy matrix of the correction acoustic model is a diagonal matrix, that is, R _k = diag (r _k ) (1 < k < K). Each r _k represents a vector composed of diagonal components of the accuracy matrix, and diag (·) represents an operation for converting the vector into a diagonal matrix.

図１２に示すように、残響時間推定手段２１１は例えば、変数初期化部２１１０と、カウント初期化部２１１１と、合成部２１１２と、係数計算部２１１３と、処理分岐部２１１４と、第一更新部２１１５と、第二更新部２１１６と、カウント増加部２１１７と、収束判定部２１１８と、減衰率変換部２１１９とを含む。 As shown in FIG. 12, the reverberation time estimation unit 211 includes, for example, a variable initialization unit 2110, a count initialization unit 2111, a synthesis unit 2112, a coefficient calculation unit 2113, a processing branch unit 2114, and a first update unit. 2115, a second update unit 2116, a count increase unit 2117, a convergence determination unit 2118, and an attenuation rate conversion unit 2119.

変数初期化部２１１０は、減衰率b、歪み分散σ、シフトhの各未知変数を初期化する（ＳＳ２１１０、変数初期化サブステップ）。これらの変数の初期値には任意の値を用いることができる。例えば、次の初期値を用いることができる。 The variable initialization unit 2110 initializes the unknown variables of the attenuation rate b, distortion variance σ, and shift h (SS2110, variable initialization substep). Arbitrary values can be used as initial values of these variables. For example, the following initial values can be used.

本方法はEMアルゴリズムに基づいており、繰り返し処理を含む。カウント初期化部２１１１は、繰り返し処理を開始する前に繰り返しカウントCを1に設定する（ＳＳ２１１１、カウント初期化サブステップ）。合成部２１１２は、すべての1<k<Kと1<n<Nについて、次式で定義される第１係数ψ_k,nと、第２係数υ_k,nとを計算する（ＳＳ２１１２、合成サブステップ）。 The method is based on the EM algorithm and includes iterative processing. The count initialization unit 2111 sets the repetition count C to 1 before starting the repetition process (SS2111, count initialization substep). The synthesizer 2112 calculates the first coefficient ψ _{k, n} and the second coefficient υ _{k, n} defined by the following equations for all 1 < k < K and 1 < n < N (SS2112, synthesize Sub-step).

ただし、Δは所与の正の整数、f(・)とg(・)はそれぞれ次式で定義される関数である。 Where Δ is a given positive integer, and f (•) and g (•) are functions defined by the following equations, respectively.

なお、乗算、除算、冪乗演算、関数演算はベクトルの要素毎に適用される。係数計算部２１１３は、すべての1<k<Kと1<n<Nについて、次式で定義される第３係数ω_k,n、第４係数l_k,n、第５係数e_k,nを計算する（ＳＳ２１１３、係数計算サブステップ）。 Note that multiplication, division, power calculation, and function calculation are applied to each vector element. The coefficient calculation unit 2113 performs the third coefficient ω _{k, n} , the fourth coefficient l _{k, n} and the fifth coefficient e _{k, n} defined by the following equations for all 1 < k < K and 1 < n < N. Is calculated (SS2113, coefficient calculation substep).

処理分岐部２１１４は、繰り返しカウントCが奇数であれば（ＳＳ２１１４ＹＥＳ）第一更新サブステップ（ＳＳ２１１５）へ、偶数であれば（ＳＳ２１１４ＮＯ）第二更新サブステップ（ＳＳ２１１６）へ処理を分岐させる。 The process branching unit 2114 branches the process to the first update substep (SS2115) if the repetition count C is an odd number (SS2114 YES), and to the second update substep (SS2116) if it is an even number (SS2114NO).

第一更新部２１１５は、次式にしたがって減衰率bと歪み分散σを更新する（ＳＳ２１１５、第一更新サブステップ）。 The first update unit 2115 updates the attenuation rate b and the distortion variance σ according to the following equation (SS2115, first update substep).

第二更新部２１１６は次式にしたがってシフトhを更新する（ＳＳ２１１６、第二更新サブステップ）。 The second update unit 2116 updates the shift h according to the following equation (SS2116, second update substep).

カウント増加部２１１７は、C=C+1を計算して繰り返しカウントを1だけ増やす（ＳＳ２１１７、カウント増加サブステップ）。収束判定部２１１８は、EMアルゴリズムが収束しているかどうかを判定し、収束していなければ（ＳＳ２１１８ＮＯ）合成サブステップ（ＳＳ２１１２）に戻り、収束していれば（ＳＳ２１１８ＹＥＳ）、減衰率変換サブステップ（ＳＳ２１１９）に進む（ＳＳ２１１８、収束判定サブステップ）。収束判定は、例えば繰り返しカウントが予め定めた閾値C_maxを超えているか否かに基づいて判定することができる。ただし、収束条件はこれに限定されるものではなく、繰り返しで変化した減衰率の変化量等に基づいて収束したか否かを判定してもよい。減衰率変換部２１１９は、上記算出された減衰率b_iに基づいて、残響時間T₆₀(i)を求める（これより以下、明確を期するため、発話インデクスiを明記する）。具体的には、Qを事前に定められた定数、avg(・)をベクトル要素の平均を求める演算として、次式によってT₆₀(i)を計算する（ＳＳ２１１９、減衰率変換サブステップ）。 The count increment unit 2117 calculates C = C + 1 and increments the repeat count by 1 (SS2117, count increment substep). The convergence determination unit 2118 determines whether or not the EM algorithm has converged, and if not converged (SS2118NO), returns to the synthesis substep (SS2112), and if converged (SS2118YES), the attenuation rate conversion substep ( The process proceeds to SS2119) (SS2118, convergence determination substep). The convergence determination can be made based on, for example, whether or not the repetition count exceeds a predetermined threshold C _max . However, the convergence condition is not limited to this, and it may be determined whether or not the convergence has occurred based on the amount of change in the attenuation rate that has been changed repeatedly. The attenuation rate conversion unit 2119 obtains the reverberation time T ₆₀ (i) based on the calculated attenuation rate b _i (hereinafter, the utterance index i is specified for clarity). Specifically, T ₆₀ (i) is calculated by the following equation using Q as a predetermined constant and avg (·) as an operation for obtaining the average of vector elements (SS2119, attenuation rate conversion substep).

＜本発明の話者適応化装置に含まれるフレーム選択部について＞
上述のように、本発明の話者適応化装置に含まれるフレーム選択部の具体的な構成例として、実施例１ではフレーム選択部１１を、実施例２ではフレーム選択部２１を開示した。実施例１のフレーム選択部１１は、劣化特徴量の集合Yを入力とし、劣化特徴量の集合YからSN比が高い劣化特徴量を適応用特徴量の集合Xとして出力することを特徴とした。また実施例２のフレーム選択部２１は、劣化特徴量の集合Yを入力とし、劣化特徴量の集合Yから残響時間が短い劣化特徴量を適応用特徴量の集合Xとして出力することを特徴とした。しかしながら、本発明の話者適応化装置に含まれるフレーム選択部は実施例１、実施例２に開示した構成に限定されない。本発明の話者適応化装置は、音声信号が非定常であることや音声に歪みを生じさせる環境が多様であることにより、SN 比は短時間フレームによって大きく異なるため、SN 比が高い、すなわちほとんどクリーンな短時間フレームが存在するという事実に着目し、歪みを含む劣化特徴量のうち、歪み、つまり劣化が少ない短時間フレームを取り出して、不特定話者モデルとして事前に用意された補正用音響モデルを話者適応化することを着想の基礎としている。従って、本発明のフレーム選択部は、劣化特徴量の集合Yから劣化の度合いが小さい劣化特徴量を抽出し、抽出した劣化特徴量の集合を適応用特徴量の集合として出力するように構成されていればよく、SN比、残響時間以外の任意の劣化度合いを示すパラメータを利用することができる。 <Frame Selection Unit Included in Speaker Adaptation Apparatus of Present Invention>
As described above, as a specific configuration example of the frame selection unit included in the speaker adaptation apparatus of the present invention, the frame selection unit 11 is disclosed in the first embodiment and the frame selection unit 21 is disclosed in the second embodiment. The frame selection unit 11 according to the first embodiment is characterized in that a degradation feature quantity set Y is input and a degradation feature quantity having a high SN ratio is output from the degradation feature quantity set Y as an adaptation feature quantity set X. . The frame selection unit 21 according to the second embodiment is characterized in that the degradation feature quantity set Y is input, and the degradation feature quantity having a short reverberation time is output from the degradation feature quantity set Y as the adaptation feature quantity set X. did. However, the frame selection unit included in the speaker adaptation apparatus of the present invention is not limited to the configuration disclosed in the first and second embodiments. The speaker adaptation apparatus of the present invention has a high SN ratio because the SN ratio varies greatly depending on the short-time frame due to the fact that the voice signal is non-stationary and the environment in which the voice is distorted varies. Focusing on the fact that there are almost clean short-time frames, out of the deterioration feature quantities including distortion, take out short-time frames with little distortion, that is, deterioration, and make corrections prepared in advance as an unspecified speaker model The idea is to adapt the acoustic model to the speaker. Therefore, the frame selection unit of the present invention is configured to extract a deterioration feature amount with a low degree of deterioration from the deterioration feature amount set Y, and output the extracted deterioration feature amount set as a set of adaptation feature amounts. It is sufficient to use parameters that indicate any degree of deterioration other than the SN ratio and reverberation time.

＜音声認識実験＞
以下、本発明の実施例２の話者適応化装置２を、図１で示した従来の音声認識装置９と組み合わせ、残響のある環境で録音された音声について音声認識実験を行った結果について説明する。図１４は話者適応化装置２を音声認識装置９と組み合わせた構成を示すブロック図である。図１４に示すように、音声認識装置９の補正用音響モデル記憶部９３に記憶されたパラメータは外部から読み書き可能になっている。これにより、話者適応化装置２は、補正用音響モデル記憶部９３に初めに記憶されている適応前パラメータ値の集合を読み込む。そして、話者適応化装置２は生成した適応後パラメータ値の集合を補正用音響モデル記憶部９３に上書きする。話者適応化装置のフレーム選択部２１は、実施例２で説明した通りの構成である。音声認識装置９と話者適応化装置２はそれぞれ、これらの装置が実行する処理の手順を記述した音声認識プログラム、話者適応化プログラムをコンピュータに実行させることで実現した。 <Voice recognition experiment>
Hereinafter, the results of performing a speech recognition experiment on speech recorded in a reverberant environment by combining the speaker adaptation device 2 of the second embodiment of the present invention with the conventional speech recognition device 9 shown in FIG. 1 will be described. To do. FIG. 14 is a block diagram showing a configuration in which the speaker adaptation device 2 is combined with the speech recognition device 9. As shown in FIG. 14, parameters stored in the correction acoustic model storage unit 93 of the speech recognition device 9 can be read and written from the outside. As a result, the speaker adaptation device 2 reads a set of pre-adaptation parameter values initially stored in the correction acoustic model storage unit 93. Then, the speaker adaptation device 2 overwrites the generated acoustic parameter storage unit 93 with the set of post-adaptation parameter values. The frame selection unit 21 of the speaker adaptation device has the configuration as described in the second embodiment. The speech recognition device 9 and the speaker adaptation device 2 are realized by causing a computer to execute a speech recognition program and a speaker adaptation program that describe the procedure of processing executed by these devices.

図１４の装置を用いて音声認識実験を行った。実験には20000語のWall Street Journalデータベースの学習、評価、適応の各データセットを用いた。学習データセットは、認識用音響モデルと話者適応化を行う前の補正用音響モデルを学習するのに用いた。評価データセットは、残響音声を模擬するために、このデータセットに含まれる各発話と予め計測したインパルス応答を畳み込んでから使用した。評価データセットは8名の話者で構成され、各話者について、適応データセットに50発話分の個人適応データが含まれている。適応データセットに含まれる発話についても、いくつかの異なる残響環境で収録されたインパルス応答を畳み込むことで、残響を含む話者適応データを模擬的に作成して使用した。 Speech recognition experiments were performed using the apparatus of FIG. The experiment used the learning, evaluation, and adaptation data sets from the 20,000-word Wall Street Journal database. The learning data set was used to learn the acoustic model for recognition and the acoustic model for correction before speaker adaptation. In order to simulate reverberant speech, the evaluation data set was used after convolution of each utterance included in the data set and an impulse response measured in advance. The evaluation data set consists of 8 speakers, and for each speaker, the adaptation data set includes personal adaptation data for 50 utterances. For the speech included in the adaptation data set, speaker adaptation data including reverberation was simulated and used by convolving impulse responses recorded in several different reverberation environments.

実験の結果、特徴量補正をまったく実施しない場合の単語誤り率は92.14%であった。話者適応化を実施しない、すなわち適応前パラメータ値を用いて特徴量補正をした場合、単語誤り率は54.93%に改善した。話者適応化を実施し、適応後パラメータ値を用いて特徴量補正をした場合、単語誤り率は更に51.83%まで改善した。この結果は、本発明で提案した補正用音響モデルの話者適応化の有効性を示す。 As a result of the experiment, the word error rate without any feature correction was 92.14%. When speaker adaptation was not performed, that is, when feature value correction was performed using pre-adaptation parameter values, the word error rate improved to 54.93%. When speaker adaptation was performed and feature values were corrected using post-adaptation parameter values, the word error rate was further improved to 51.83%. This result shows the effectiveness of speaker adaptation of the correction acoustic model proposed in the present invention.

以上、具体的な実施の形態を挙げて本発明を説明したが、本発明は必ずしも上記の実施形態や実施例に限定されるものではない。本発明は、既に述べた技術的思想の範囲内において様々な形態で実施することが出来る。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Although the present invention has been described with reference to specific embodiments, the present invention is not necessarily limited to the above-described embodiments and examples. The present invention can be implemented in various forms within the scope of the technical idea already described. In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述した本発明の話者適応化装置は、例えば図１５に示すコンピュータ８の記録部に、各装置が有すべき機能の処理内容を記述したプログラムを読み込ませ、演算処理装置８１、出力装置８２、入力装置８３、記憶装置８４、等を動作させることで上記処理機能がコンピュータ上で実現される。 Further, the speaker adaptation apparatus of the present invention described above, for example, causes the recording unit of the computer 8 shown in FIG. 15 to read a program describing the processing contents of the functions that each apparatus should have, the arithmetic processing unit 81, and the output The processing functions are realized on a computer by operating the device 82, the input device 83, the storage device 84, and the like.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good.

なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer). In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

The acoustic feature quantity of the distorted speech including at least one utterance is defined as the degradation feature quantity.
The acoustic model used for correcting the influence of distortion is the acoustic model for correction,
When the acoustic model for correction learned using the voice of an unspecified speaker is an unspecified speaker model,
A set of degradation features recorded in advance is input, a degradation feature having a low degree of degradation is extracted from the degradation feature, and the extracted degradation feature is output as a set of adaptation features A frame selection unit to perform,
The set of feature values for adaptation and the set of pre-adaptation parameter values that are parameter values that define an unspecified speaker model are input, and the set of pre-adaptation parameter values is spoken to match the feature values for adaptation. And a parameter modification unit that outputs a set of post-adaptation parameter values that are speaker-adapted parameter values,
avg (·) represents an operation for obtaining an average of vector elements, i represents a natural number representing an index of an utterance included in the distorted speech, b _i represents an attenuation rate of the i-th utterance , and a predetermined constant is defined. It shall be expressed as Q,
The frame selection unit
Using a set of the inputted deterioration characteristic quantity and the set of the adaptive pre-parameter values, i-th speech degradation characteristic of the reverberation time T ₆₀ the _(i),

And estimated constant, the estimated reverberation time is extracted as the deterioration characteristic quantity degree is smaller the deterioration less than a predetermined threshold value deterioration characteristic quantity speaker adaptation apparatus.

A speaker adaptation method performed by a speaker adaptation device, comprising:
The acoustic feature quantity of the distorted speech including at least one utterance is defined as the degradation feature quantity.
The acoustic model used for correcting the influence of distortion is the acoustic model for correction,
When the acoustic model for correction learned using the voice of an unspecified speaker is an unspecified speaker model,
A set of deterioration feature values recorded in advance is input, the deterioration feature amount having a small degree of deterioration is extracted from the set of deterioration feature values, and the set of extracted deterioration feature values is set as a set of feature values for adaptation. A frame selection step to output;
The set of feature values for adaptation and the set of pre-adaptation parameter values that are parameter values that define an unspecified speaker model are input, and the set of pre-adaptation parameter values is spoken to match the feature values for adaptation. A parameter modification step for outputting a set of post-adaptation parameter values that are speaker-adapted parameter values,
avg (·) represents an operation for obtaining an average of vector elements, i represents a natural number representing an index of an utterance included in the distorted speech, b _i represents an attenuation rate of the i-th utterance , and a predetermined constant is defined. It shall be expressed as Q,
The frame selection step includes:
Using a set of the inputted deterioration characteristic quantity and the set of the adaptive pre-parameter values, i-th speech degradation characteristic of the reverberation time T ₆₀ the _(i),

And estimated constant, speaker adaptation method of reverberation time the estimated extracts as the deterioration characteristic quantity degree is smaller the deterioration less than a predetermined threshold value deterioration characteristic quantity.

A program for causing a computer to function as the speaker adaptation device according to claim 1 .