JP2003177779A

JP2003177779A - Speaker learning method for speech recognition

Info

Publication number: JP2003177779A
Application number: JP2001378341A
Authority: JP
Inventors: Yumi Wakita; 由実脇田; Kenji Mizutani; 研治水谷; Shinichi Yoshizawa; 伸一芳澤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2001-12-12
Filing date: 2001-12-12
Publication date: 2003-06-27
Anticipated expiration: 2021-12-12
Also published as: JP3876703B2

Abstract

(57)【要約】【課題】従来の認識性能の低い話者の性能を向上させ
るための話者適応や話者登録学習では、学習用の発声量
が多くなるため話者に負担がかかるか、または負担を軽
くするために発声量を制限した場合には、全ての発声に
おいて認識性能が向上するとは限らず、認識率が低下す
る単語も出現する可能性がある、という問題を有してい
る。【解決手段】少ない発声で発声内容が認識結果に依存
しているかどうかを推定し、依存していない場合には話
者適応学習、依存している場合には話者登録学習を行う
ことにより、話者の負担にならない程度の学習発声で、
確実に認識率を向上させることができる話者学習法を提
供できる。 (57) [Summary] [Problem] In conventional speaker adaptation and speaker registration learning to improve the performance of speakers with low recognition performance, does the speaker become burdensome because the amount of utterance for learning increases? If the amount of utterance is limited to reduce the burden, the recognition performance may not be improved in all utterances, and there is a possibility that a word with a lower recognition rate may appear. Yes. By estimating whether the utterance content depends on the recognition result with less utterance, performing speaker adaptive learning if not dependent, and speaker registration learning if dependent, With a learning utterance that is not burdened by the speaker,
It is possible to provide a speaker learning method that can surely improve the recognition rate.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識における
話者学習法に関するものである。TECHNICAL FIELD The present invention relates to a speaker learning method in speech recognition.

【０００２】[0002]

【従来の技術】以下、従来の話者学習法を説明する。従
来の不特定話者音声認識システムでは、なるべく不特定
多数の話者に対応できる標準的な音響モデルを構築して
用いているが、実用上では、話者の発声特徴は多種多様
であり、全ての使用話者に対して高性能を保証する音響
モデルを学習することは困難である。そこで従来は、認
識しない話者について、話者自身の発声を用いて音響モ
デルパラメータを再学習し、話者に適応した音響モデル
を再構築することにより全話者に対する性能を保証する
話者適応手段をとっている。この話者適応には話者の特
徴を捉えるに十分な多くの学習用音声が必要であるが、
発声者の負担になるので、最低限の発声回数に絞る様々
な工夫がなされている（たとえば、特許第2037877）。
一方、別の学習方法として、誤認識した単語の認識結果
に相当する音響モデル系列を正解系列として発音辞書に
追加し、誤った系列として認識したものを正しい系列と
して認識することを可能とする話者登録方法もある（特
開平8-171396号公報）。2. Description of the Related Art A conventional speaker learning method will be described below. In the conventional unspecified speaker speech recognition system, a standard acoustic model that can handle as many unspecified speakers as possible is constructed and used, but in practice, the utterance characteristics of the speaker are diverse, It is difficult to learn an acoustic model that guarantees high performance for all speakers. Therefore, conventionally, speaker adaptation that guarantees performance for all speakers by re-learning the acoustic model parameters of the unrecognized speaker using the speaker's own utterance and reconstructing the acoustic model adapted to the speaker. I am taking means. This speaker adaptation requires a large amount of learning speech to capture the characteristics of the speaker,
Since it imposes a burden on the speaker, various measures have been taken to limit the minimum number of times of vocalization (for example, Japanese Patent No. 2037877).
On the other hand, as another learning method, it is possible to add an acoustic model sequence corresponding to the recognition result of an erroneously recognized word to the pronunciation dictionary as a correct answer sequence and recognize what is recognized as an incorrect sequence as a correct sequence. There is also a person registration method (Japanese Patent Laid-Open No. 8-171396).

【０００３】[0003]

【発明が解決しようとする課題】従来の話者適応法は、
学習データが十分あれば、原理的に確実に認識性能を向
上できる手法であるが、ほとんど全ての実用上システム
では行われているように、話者の学習負担を考慮して発
声回数が絞られた場合、学習データに存在しない一部の
発声に対して、逆に認識率が低下してしまう可能性があ
るという問題がある。一方、従来の話者登録法は、学習
された発声部分の認識率は確実に向上するが、多くの発
声内容で認識しにくい話者の場合は、学習時に認識しに
くい全ての発声をしなければならず学習に負担がかか
る、という問題がある。The conventional speaker adaptation method is as follows.
In principle, this method can certainly improve the recognition performance if there is sufficient learning data.However, as is the case with almost all practical systems, the number of utterances is narrowed in consideration of the learning burden on the speaker. In that case, there is a problem that the recognition rate may be reduced on the contrary for some utterances that do not exist in the learning data. On the other hand, the conventional speaker registration method surely improves the recognition rate of learned utterances, but in the case of a speaker that is difficult to recognize due to a large amount of utterance content, all utterances that are difficult to recognize during learning must be used. There is a problem that learning is burdensome.

【０００４】本発明の目的は、従来の話者適応学習と話
者登録学習の問題点を解決し、話者に負担にならない学
習発声量で、学習後に確実に認識率を向上させる話者学
習法を提供するものである。An object of the present invention is to solve the problems of conventional speaker adaptation learning and speaker registration learning, and to improve the recognition rate after learning with a learning voicing amount that does not burden the speaker. It provides the law.

【０００５】[0005]

【課題を解決するための手段】上述した課題を解決する
ために、請求項１から５に記載の話者学習法は、話者の
学習用音声を用いて音響モデルパラメータを再学習し、
話者に適応した音響モデルを作成する手段と、誤認識し
た単語の認識結果に相当する音響モデル系列を正解系列
として発音辞書に追加する手段と、認識しやすさが発声
内容に依存するかどうかを判断する手段とから構成され
る。In order to solve the above-mentioned problems, the speaker learning method according to any one of claims 1 to 5 re-learns acoustic model parameters using a speaker's learning voice,
A means to create an acoustic model adapted to the speaker, a means to add the acoustic model sequence corresponding to the recognition result of the misrecognized word to the pronunciation dictionary as a correct sequence, and whether the recognizability depends on the utterance content. And a means for determining.

【０００６】[0006]

【発明の実施の形態】以下、図面を参照して本発明の請
求項１〜５に記載の話者学習法を説明する。BEST MODE FOR CARRYING OUT THE INVENTION A speaker learning method according to claims 1 to 5 of the present invention will be described below with reference to the drawings.

【０００７】図1は本発明の請求項１〜５の話者学習法
ブロック図である。FIG. 1 is a block diagram of a speaker learning method according to claims 1 to 5 of the present invention.

【０００８】各話者が自分に対する認識性能を向上させ
る必要を感じた場合に選択するように設定された話者学
習機能において、まず、システムからユーザに対し特定
単語発声を促し、話者の特定単語発声が入力される。こ
の発声内容は、各話者に対して、予め準備した標準音声
がどのくらい適切かを判断するのに必要な最低限の内容
であり、たとえば日本語認識の場合は、５母音を全て含
む単語「マイクテスト」などの内容がふさわしい。シス
テムが単語認識の場合には５母音が全て含まれるように
対象単語から複数単語を選択しても良い。In the speaker learning function set so that each speaker selects when he / she needs to improve the recognition performance for himself / herself, the system first prompts the user to speak a specific word to identify the speaker. The word utterance is input. This utterance content is the minimum content necessary for each speaker to judge how appropriate the standard speech prepared in advance is, and in the case of Japanese recognition, for example, the word " Contents such as "microphone test" are appropriate. When the system recognizes words, a plurality of words may be selected from the target words so that all five vowels are included.

【０００９】この発声に対して音声認識処理１で通常の
認識処理が行われ、認識スコア算出処理２で認識結果と
認識信頼度スコアが計算される。認識結果は、認識結果
の音素または音節系列と正解の音素系列とを比較し、異
なっている部分を誤りとし一致している部分を正解とし
て、正解系列の各音素毎に正誤を記録しておく。また信
頼度スコアは、たとえば正解音素または音節系列と発声
された結果との各音素または音節毎の音響的距離スコア
であり、距離尺度として重み付きケプストラム距離を用
いた場合は、各音素の信頼度は式１で算出されるものを
用いてもよい。A normal recognition process is performed on this utterance in the voice recognition process 1, and a recognition result and a recognition reliability score are calculated in the recognition score calculation process 2. As for the recognition result, the phoneme or syllable sequence of the recognition result is compared with the correct phoneme sequence, and the difference is regarded as an error and the matching part is regarded as the correct answer, and the correctness is recorded for each phoneme of the correct answer sequence. . The reliability score is, for example, an acoustic distance score for each phoneme or syllable between the correct phoneme or syllable sequence and the uttered result, and when the weighted cepstrum distance is used as the distance measure, the reliability of each phoneme is calculated. May be calculated by Equation 1.

【００１０】[0010]

【数１】 [Equation 1]

【００１１】学習法決定処理３では、信頼度スコアが閾
値以下であるか、閾値以上であったとしても誤認識して
いる音素または音節（適応候補音素または音節と呼ぶ）
の全発声に含まれる音素または音節に対する割合を計算
する。この割合が大きい場合は、発声内容に依存せず話
者の発声特徴が標準音声に適用していないことが推定さ
れ、全ての標準音声を話者に適用するように学習する必
要があると考えられる。また、この割合が小さい場合に
は、誤認識は発声内容に依存しており、話者の発声特徴
と標準音声は適用しているが、特定の発声においてのみ
学習が必要であると考えられる。従って、この割合が一
定値以上である場合、話者適応学習を選択し、一定値以
下である場合、話者登録学習を選択する。In the learning method determination process 3, a phoneme or syllable (referred to as an adaptive candidate phoneme or syllable) which is erroneously recognized even if the reliability score is equal to or lower than the threshold value or equal to or higher than the threshold value.
Calculate the ratio of phonemes or syllables included in all utterances of. If this ratio is large, it is estimated that the speaker's utterance features do not apply to the standard voice, regardless of the utterance content, and it is necessary to learn to apply all standard voices to the speaker. To be If this ratio is small, the misrecognition depends on the utterance content, and although the utterance feature of the speaker and the standard voice are applied, it is considered that learning is necessary only for a specific utterance. Therefore, if this ratio is a certain value or more, speaker adaptive learning is selected, and if it is less than a certain value, speaker registration learning is selected.

【００１２】話者適応学習を選択した場合は、話者適応
処理４で、ユーザにさらに適応するに必要最低限の発声
を促す。話者適応法は、たとえば、特開平5-53599に記
載のＶＦＳ法を利用した場合には、標準音響モデルと学
習用入力音声パラメータとをマッチングし、対応するパ
ラメータの関係からファジー級関数を求め、求められた
関数を重みとして、標準音声を学習用入力音声に近づく
ように標準音響モデルのパラメータを更新している。When the speaker adaptation learning is selected, the speaker adaptation processing 4 prompts the user to make a minimum necessary utterance for further adaptation. In the speaker adaptation method, for example, when the VFS method described in Japanese Patent Laid-Open No. 5-53599 is used, a standard acoustic model is matched with a learning input speech parameter, and a fuzzy class function is obtained from the relationship between the corresponding parameters. , The parameters of the standard acoustic model are updated so that the standard speech approaches the learning input speech with the obtained function as a weight.

【００１３】また、話者登録学習を選択した場合には、
話者登録処理５で、学習決定処理で算出した適応候補音
素または音節が含まれている単語のみの発声を促し、適
応候補音素に相当する音素系列を含む単語の音素系列
に、発声に対する音素または音節認識結果系列を発音辞
書７に追加する。たとえば、「メニュー」という単語が
誤認識を起こす場合、この単語のみの発声を促し、その
認識結果が「デニュー」であったとする。音響モデルと
して音素モデルを使用している場合には、「メニュー」
の正しい音素モデル系列は/m e ny u u/であり、認識結
果音素系列は/d eny u u/である。この話者の場合、単
語の始めであり、次に/e/が続く音素/m/は/d/に誤る傾
向があることがわかる。そこで、認識対象単語の中で、
単語の先頭であり、次が/e/である/m/は/d/と誤っても/
m/と認識するように、発音辞書に音素系列を追加する。
この例の場合には、もともと辞書上で「メニュー/m e n
y uu/」であったところに/d e ny u u/を追加し、「メ
ニュー/m e ny u u/または/de ny u u/」と辞書を変更
する。これにより、この話者が「メニュー」を/d e ny
u u/ と認識しても結果的には「メニュー」が認識でき
ることになる。If speaker registration learning is selected,
In the speaker registration process 5, the utterance of only the word including the adaptive candidate phoneme or syllable calculated in the learning determination process is urged, and the phoneme sequence of the utterance is added to the phoneme sequence of the word including the phoneme sequence corresponding to the adaptive candidate phoneme. The syllable recognition result sequence is added to the pronunciation dictionary 7. For example, if the word "menu" causes erroneous recognition, it is assumed that only this word is uttered and the recognition result is "denu". If you are using a phoneme model as an acoustic model, click "Menu".
The correct phoneme model sequence of is / me ny uu /, and the recognition result phoneme sequence is / d eny uu /. For this speaker, it can be seen that the phoneme / m /, which is the beginning of a word and is followed by / e /, tends to be mistaken for / d /. So, in the recognition target word,
/ M /, which is the beginning of a word and the next is / e /, is mistaken as / d /
Add a phoneme sequence to the pronunciation dictionary to recognize it as m /.
In the case of this example, the menu originally "menu / men
Add "/ de ny uu /" instead of "y uu /" and change the dictionary to "menu / me ny uu / or / de ny uu /". This will cause this speaker to drop the "menu"
Even if it is recognized as uu /, the result is that the "menu" can be recognized.

【００１４】以上のように、話者の発声が発声内容に依
存せずに誤るかどうかを推定し、発声内容に依存しない
場合は話者適応学習、依存する場合は話者登録学習を行
うことにより、従来の話者適応学習で、適応するための
多くの学習発声をしたにもかかわらず認識率が低下する
問題を、話者適応学習ではなく話者登録学習を行うこと
で解決することができる。また、従来の話者登録学習
で、多くの単語を発声しなければ学習できなかった問題
を、話者登録学習ではなく話者適応学習を行うことで解
決することができる。As described above, it is estimated whether or not the utterance of the speaker is erroneous without depending on the utterance content, and if it does not depend on the utterance content, speaker adaptive learning is performed, and if so, speaker registration learning is performed. Thus, the problem that the recognition rate decreases in the conventional speaker adaptation learning despite many learning utterances can be solved by performing speaker registration learning instead of speaker adaptation learning. it can. In addition, the problem that cannot be learned without uttering many words in the conventional speaker registration learning can be solved by performing speaker adaptation learning instead of speaker registration learning.

【００１５】[0015]

【発明の効果】以上詳述したように、本発明に係る請求
項１に記載の話者学習法は、各話者の認識しやすさと発
声内容の依存の強さによって、話者適応学習を行うか話
者登録学習を行うかの選択を行い、どちらかの学習を話
者に促すことにより、従来の話者適応学習において、適
応するための多くの学習発声をしたにもかかわらず認識
率が低下する問題を、話者適応学習のかわりに話者登録
学習を自動選択することで解決することができる。ま
た、従来の話者登録学習において、多くの単語を発声し
なければ学習できなかった問題を、話者登録学習のかわ
りに話者適応学習を自動選択することで解決することが
できる。従って、話者に負担にならない程度の学習量
で、確実に認識率を向上させることが可能である話者学
習法を提供するものである。As described above in detail, the speaker learning method according to the first aspect of the present invention enables speaker adaptive learning to be performed depending on the recognizability of each speaker and the degree of dependence of the utterance content. By deciding whether to perform learning or speaker registration learning, and encouraging the speaker to learn either, recognition rate in the conventional speaker adaptation learning despite many vocalizations for adaptation Can be solved by automatically selecting speaker registration learning instead of speaker adaptation learning. Further, in the conventional speaker registration learning, a problem that cannot be learned without uttering many words can be solved by automatically selecting speaker adaptation learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can surely improve the recognition rate with a learning amount that does not burden the speaker.

【００１６】以上詳述したように、本発明に係る請求項
２に記載の話者学習法は、認識のしやすさが発声内容に
依存するかどうかを判断する手段において、依存するこ
とが判断できる最低限の学習用発声に対する認識スコア
を計算し、スコアの高さから依存するかどうかを決定す
ることにより、従来の話者適応学習において、適応する
ための多くの学習発声をしたにもかかわらず認識率が低
下する問題を、話者適応学習のかわりに話者登録学習を
自動選択することで解決することができる。また、従来
の話者登録学習において、多くの単語を発声しなければ
学習できなかった問題を、話者登録学習のかわりに話者
適応学習を自動選択することで解決することができる。
従って、話者に負担にならない程度の学習量で、確実に
認識率を向上させることが可能である話者学習法を提供
するものである。As described in detail above, the speaker learning method according to the second aspect of the present invention determines that the easiness of recognition depends on the utterance content. By calculating the recognition score for the minimum possible training utterance and determining whether it depends on the score's height, it is possible to use many learning utterances for adaptation in conventional speaker adaptation learning. The problem of a low recognition rate can be solved by automatically selecting speaker registration learning instead of speaker adaptation learning. Further, in the conventional speaker registration learning, a problem that cannot be learned without uttering many words can be solved by automatically selecting speaker adaptation learning instead of speaker registration learning.
Therefore, the present invention provides a speaker learning method that can surely improve the recognition rate with a learning amount that does not burden the speaker.

【００１７】以上詳述したように、本発明に係る請求項
３に記載の話者学習法は、認識しやすさが発声内容に依
存するかどうかを判断した結果、依存すると判断された
場合には話者登録学習を行い、依存しないと判断された
場合には話者適応学習を行うことにより、従来の話者適
応学習において、適応するための多くの学習発声をした
にもかかわらず認識率が低下する問題を、話者適応学習
のかわりに話者登録学習を自動選択することで解決する
ことができる。また、従来の話者登録学習において、多
くの単語を発声しなければ学習できなかった問題を、話
者登録学習のかわりに話者適応学習を自動選択すること
で解決することができる。従って、話者に負担にならな
い程度の学習量で、確実に認識率を向上させることが可
能である話者学習法を提供するものである。As described in detail above, the speaker learning method according to the third aspect of the present invention determines whether or not the recognizability depends on the utterance content, and when it is determined that the recognizability depends on the utterance content. Performs speaker registration learning, and if it is determined that it does not depend on the speaker adaptation learning, the recognition rate is increased in the conventional speaker adaptation learning, even though many learning utterances are made for adaptation. Can be solved by automatically selecting speaker registration learning instead of speaker adaptation learning. Further, in the conventional speaker registration learning, a problem that cannot be learned without uttering many words can be solved by automatically selecting speaker adaptation learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can surely improve the recognition rate with a learning amount that does not burden the speaker.

【００１８】以上詳述したように、本発明に係る請求項
４に記載の話者学習法は、認識スコアを、認識結果の正
誤結果あるいは標準音声との距離値あるいは左記距離値
の信頼度を各々単独かまたは組み合わせて算出されるこ
とにより、従来の話者適応学習において、適応するため
の多くの学習発声をしたにもかかわらず認識率が低下す
る問題を、話者適応学習のかわりに話者登録学習を自動
選択することで解決することができる。また、従来の話
者登録学習において、多くの単語を発声しなければ学習
できなかった問題を、話者登録学習のかわりに話者適応
学習を自動選択することで解決することができる。従っ
て、話者に負担にならない程度の学習量で、確実に認識
率を向上させることが可能である話者学習法を提供する
ものである。As described above in detail, in the speaker learning method according to the fourth aspect of the present invention, the recognition score is calculated based on the accuracy of the recognition result, the distance value from the standard voice, or the reliability of the distance value on the left. In the conventional speaker adaptive learning, the problem that the recognition rate decreases even though many learning utterances for adaptation are calculated by calculating them individually or in combination, instead of the speaker adaptive learning. The problem can be solved by automatically selecting person registration learning. Further, in the conventional speaker registration learning, a problem that cannot be learned without uttering many words can be solved by automatically selecting speaker adaptation learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can surely improve the recognition rate with a learning amount that does not burden the speaker.

【００１９】以上詳述したように、本発明に係る請求項
５に記載の話者学習法は、認識スコアを、認識結果の正
誤結果あるいは標準音声との距離値あるいは左記距離値
の信頼度を各々単独かまたは組み合わせて算出されるこ
とにより、従来の話者適応学習において、適応するため
の多くの学習発声をしたにもかかわらず認識率が低下す
る問題を、話者適応学習のかわりに話者登録学習を自動
選択することで解決することができる。また、従来の話
者登録学習において、多くの単語を発声しなければ学習
できなかった問題を、話者登録学習のかわりに話者適応
学習を自動選択することで解決することができる。従っ
て、話者に負担にならない程度の学習量で、確実に認識
率を向上させることが可能である話者学習法を提供する
ものである。As described above in detail, in the speaker learning method according to the fifth aspect of the present invention, the recognition score is calculated as the correctness / incorrectness result of the recognition result, the distance value from the standard voice, or the reliability of the distance value on the left. In the conventional speaker adaptive learning, the problem that the recognition rate decreases even though many learning utterances for adaptation are calculated by calculating them individually or in combination, instead of the speaker adaptive learning. The problem can be solved by automatically selecting person registration learning. Further, in the conventional speaker registration learning, a problem that cannot be learned without uttering many words can be solved by automatically selecting speaker adaptation learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can surely improve the recognition rate with a learning amount that does not burden the speaker.

[Brief description of drawings]

【図１】本発明の一実施例である話者学習法ブロック図FIG. 1 is a block diagram of a speaker learning method according to an embodiment of the present invention.

[Explanation of symbols]

１音声認識２認識スコア算出３学習法決定４話者適応５話者登録６音響モデル７発音辞書８認識スコアバッファ 1 voice recognition 2 Recognition score calculation 3 Learning method decision 4 speaker adaptation 5 Speaker registration 6 acoustic models 7 Pronunciation dictionary 8 recognition score buffer

───────────────────────────────────────────────────── フロントページの続き (72)発明者芳澤伸一大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5D015 AA02 AA03 GG01 GG04 GG05 GG06 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Shinichi Yoshizawa 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. F-term (reference) 5D015 AA02 AA03 GG01 GG04 GG05 GG06

Claims

[Claims]

1. A means for re-learning acoustic model parameters using a speaker's learning voice (hereinafter referred to as speaker adaptive learning), and a means for erroneously recognizing a word. It has means for adding an acoustic model sequence corresponding to the recognition result as a correct answer sequence to the pronunciation dictionary (hereinafter referred to as speaker registration learning), and means for determining whether the recognizability depends on the utterance content. , It is characterized that the speaker adaptive learning or speaker registration learning is selected according to the recognizability of each speaker and the dependence of the utterance content, and the speaker is urged to learn either of them. Speaker learning method.

2. The speaker learning method according to claim 1, wherein
The means to judge whether the easiness of recognition depends on the utterance content calculates the recognition score for the minimum learning utterance that can be judged to be dependent, and determines whether it depends from the height of the score. A speaker learning method characterized by that.

3. The speaker learning method according to claim 1, wherein
As a result of judging whether the recognizability depends on the utterance content, if it is judged to be dependent, speaker registration learning is performed, and if it is judged not to be dependent, speaker adaptation learning is performed. Speaker learning method.

4. The recognition score in the speaker learning method according to claim 2, is calculated by using the accuracy of the recognition result, the distance value to the standard voice, or the reliability of the distance value on the left, respectively, alone or in combination. Speaker learning method characterized by.

5. The recognition score in the speaker learning method according to claim 2, is calculated by using the accuracy of the recognition result, the distance value from the standard voice, or the reliability of the distance value on the left, respectively, alone or in combination. Speaker learning method characterized by.