JP2000305589A

JP2000305589A - Adaptive type voice recognition device, voice processing device and pet toy

Info

Publication number: JP2000305589A
Application number: JP11108751A
Authority: JP
Inventors: Takayuki Hiekata; 孝之稗方; Tetsuya Takahashi; 哲也高橋; Hiroshi Hashimoto; 裕志橋本; Yoshiro Nishimoto; 善郎西元
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 1999-04-16
Filing date: 1999-04-16
Publication date: 2000-11-02

Abstract

PROBLEM TO BE SOLVED: To restrain generation of a bad influence on a renewed voice model caused by meaningless or wrong intonation or pronounciation by controlling the renewing level of voice model by the use of renewal parameter in renewing the voice model based on the characteristic quantity of input voice. SOLUTION: In renewing the voice model based on the characteristic quantity of input voice, the degree of renewing the voice model is controlled by a renewal parameter. In a voice recognition device A1, the similarity degrees of the characteristic quantity and voice model are computed by an evaluation value computing part 3 when the characteristics quanty is extracted by a characteristic extraction part 1. Also, in the device A1, a renewal parameter determination part 6 determines the renewal parameter for the input voice according to the conditions considered to be favorable in learning so as to avoid unfaborable learning. The part 6 determines the renewal parameter based on the similarity degrees, for example.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は，適応型音声認識装
置，音声処理装置，及びペット玩具に係り，詳しくは，
特定の話者に適応させるべく，音声認識に用いる音声モ
デルを入力音声に応じて更新する適応型音声認識装置，
音声処理装置，及びペット玩具に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an adaptive speech recognition device, a speech processing device, and a pet toy.
An adaptive speech recognizer that updates the speech model used for speech recognition according to the input speech in order to adapt to a specific speaker,
The present invention relates to a voice processing device and a pet toy.

【０００２】[0002]

【従来の技術】近年，装置の小型化や多機能化，処理内
容の高度化などに伴って，音声認識技術の重要性がます
ます高まりつつある。上記音声認識技術には，種々の手
法が存在しているが，ダイナミックプログラミングマッ
チング（ＤＰマッチング）や，隠れマルコフモデル（Ｈ
ＭＭ）等の手法を用いた，話者を特定しない不特定話者
システムの開発研究が盛んである。不特定話者システム
では，大量の言語情報に基づいて予め作成された音声モ
デルを用いて音声認識が行われるため，使用者の負担が
少ないなどの利点が存在するからである。しかしなが
ら，不特定話者システムでは，話者によっては認識率が
かなり低下してしまう場合がある。このため，使用目的
によっては，不特定話者システムにおいても，特定話者
への適応化を行って特定話者に対する認識率を向上させ
る必要がある。不特定話者システムにおける話者適応化
は，上記予め作成された音声モデルを当該話者の発声に
応じて更新することにより行われる。このような不特定
話者システムの話者適応化に関する技術は，例えば特開
平７−２３０２９５号公報（以下，参照公報１という）
や，特開平９−８１１８３号公報（以下，参照公報２と
いう）などに記載されている。上記参照公報１に記載の
音声認識技術は，混合連続分布型ＨＭＭを用いた話者適
応化に関する技術である。上記参照公報１に記載の音声
認識技術では，各認識候補単語の単語ＨＭＭが保持され
ており，入力音声に対応する入力パターンが作成される
と，上記単語ＨＭＭを用いて上記入力パターンに対する
認識が行われ，その認識結果が出力される。そして，適
応化に関しては，認識結果単語の表記が参照され適応化
初期単語ＨＭＭが用意される。この適応化初期単語ＨＭ
Ｍと上記入力パターンに基づいた尤度計算が，一つ又は
複数の入力パターンについて行われ，それに基づいて適
応化後の平均ベクトルが求められる。この適応化後の平
均ベクトルから適応化後ＨＭＭが生成され，もとのＨＭ
Ｍと入れ替えられることにより，話者適応化が行われ
る。また，上記参照公報２に記載の音声認識に関する技
術は，状態遷移確率，平均ベクトル，分散の３つのパラ
メータによって規定される連続分布型のＨＭＭを用いた
ものであって，入力された学習用音声からその平均ベク
トルを算出すると共に，入力された学習用音声と近似す
るＨＭＭを初期モデルとして登録辞書から選択し，上記
選択されたＨＭＭ中の平均ベクトルを，上記算出された
学習用音声に対する平均ベクトルに置き換えることによ
り，学習用音声を極めて少ない回数だけ使用者が発声す
るのみで，話者適応化を行っている。2. Description of the Related Art In recent years, the importance of speech recognition technology has been increasing with the miniaturization and multifunctionality of devices and the sophistication of processing contents. There are various methods for the above-mentioned speech recognition technology, such as dynamic programming matching (DP matching) and hidden Markov model (H
MM) and the like, and an unspecified speaker system that does not specify a speaker has been actively researched. This is because in the speaker-independent system, since speech recognition is performed using a speech model created in advance based on a large amount of linguistic information, there are advantages such as a reduced burden on the user. However, in an unspecified speaker system, the recognition rate may be considerably reduced depending on the speaker. For this reason, depending on the purpose of use, even in an unspecified speaker system, it is necessary to improve the recognition rate for the specific speaker by performing adaptation to the specific speaker. The speaker adaptation in the unspecified speaker system is performed by updating the previously created speech model according to the utterance of the speaker. A technique relating to speaker adaptation of such an unspecified speaker system is disclosed in, for example, Japanese Patent Laid-Open No. 7-230295 (hereinafter referred to as Reference 1).
And Japanese Patent Application Laid-Open No. 9-81183 (hereinafter referred to as Reference Publication 2). The speech recognition technology described in the above-mentioned Reference Publication 1 is a technology relating to speaker adaptation using a mixed continuous distribution type HMM. In the speech recognition technology described in Reference 1, the word HMM of each recognition candidate word is held, and when an input pattern corresponding to an input speech is created, recognition of the input pattern is performed using the word HMM. Is performed, and the recognition result is output. Then, regarding the adaptation, the notation of the recognition result word is referred to, and an adaptation initial word HMM is prepared. This adapted initial word HM
A likelihood calculation based on M and the input pattern is performed for one or a plurality of input patterns, and an average vector after the adaptation is obtained based on the calculation. An adapted HMM is generated from the adapted average vector, and the original HM
By being replaced with M, speaker adaptation is performed. The technique related to speech recognition described in Reference Publication 2 uses a continuous distribution type HMM defined by three parameters of a state transition probability, an average vector, and a variance. , And an HMM approximating the input learning speech is selected from the registered dictionary as an initial model, and the average vector in the selected HMM is calculated as the average vector for the calculated learning speech. Thus, the speaker adaptation is performed only by the user uttering the learning voice a very small number of times.

【０００３】[0003]

【発明が解決しようとする課題】上記参照公報１や参照
公報２に記載のような従来技術では，入力音声がどのよ
うなものであるか考慮されておらず，発声が誤って行わ
れたり，無関係な音声が混入された場合でも，それに応
じて話者適応化が行われてしまう。このため，話者適応
化を行うつもりが，かえって上記音声モデルを改悪して
しまう恐れがある。また，近年，音声認識された言語情
報に対して鳴き声を発声するなどの応答を行う，犬や猫
などのペットを模擬したペット玩具が注目されている
が，このような用途に用いる場合には，ペットが飼い主
に徐々に馴染んでいく現象を模倣するために，適応化の
度合いを調整できる方が好ましい。本発明は，このよう
な従来の技術に関する課題を解決するために，適応型音
声認識装置，音声処理装置，及びペット玩具を改良し，
入力音声によって音声モデルの更新度合いを変化させ，
例えば入力音声の特徴量と音声モデルとの類似度が低い
場合などには音声モデルの更新度合いを低く設定するこ
とにより，適切な話者適応を行うことができる適応型音
声認識装置，音声処理装置，及びペット玩具を提供する
ことを目的とするものである。In the prior arts described in the above-mentioned Reference Publications 1 and 2, the input speech is not taken into account, and utterances are erroneously made. Even when irrelevant speech is mixed, speaker adaptation is performed accordingly. For this reason, there is a possibility that the speaker model may be deteriorated even though the speaker adaptation is intended. In recent years, pet toys that simulate pets such as dogs and cats that respond to the speech-recognized linguistic information, such as making a bark, have been attracting attention. It is preferable that the degree of adaptation can be adjusted in order to mimic the phenomenon in which the pet gradually adapts to the owner. The present invention has improved an adaptive speech recognition device, a speech processing device, and a pet toy in order to solve such problems related to the conventional technology.
The degree of updating of the voice model is changed according to the input voice,
For example, when the similarity between the feature amount of the input speech and the speech model is low, by setting the update degree of the speech model to be low, the adaptive speech recognition device and the speech processing device can perform appropriate speaker adaptation. , And pet toys.

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するため
に，請求項１に係る発明は，入力音声の特徴量を抽出す
る特徴量抽出手段と，認識対象語に対応する音声モデル
を記憶する音声モデル記憶手段と，上記特徴量抽出手段
により抽出された上記特徴量と，上記音声モデル記憶手
段に記憶された上記音声モデルとの類似度を計算する類
似度計算手段と，上記類似度計算手段により計算された
上記類似度に基づいて，上記入力音声に対応する認識対
象語の選択判定を行う認識対象語選択判定手段と，上記
特徴量抽出手段により抽出された上記特徴量に基づい
て，上記音声モデル記憶手段に記憶された上記音声モデ
ルを更新する音声モデル更新手段と，上記音声モデル更
新手段による上記音声モデルの更新度合いを制御する更
新パラメータを決定する更新パラメータ決定手段とを具
備してなる適応型音声認識装置として構成されている。
上記請求項１に記載の適応型音声認識装置によれば，入
力音声の特徴量に基づいて上記音声モデルを更新する際
に，更新パラメータにより上記音声モデルの更新度合い
が制御されるため，無意味な発声や誤った発声によっ
て，更新される上記音声モデルに与えられる悪影響を抑
えることが可能となる。According to a first aspect of the present invention, there is provided a feature extracting means for extracting a feature of an input speech, and a speech model corresponding to a recognition target word. Voice model storage means, similarity calculation means for calculating the similarity between the feature quantity extracted by the feature quantity extraction means and the voice model stored in the voice model storage means, and similarity calculation means Based on the similarity calculated by (1), a recognition target word selection determining unit that performs selection determination of a recognition target word corresponding to the input voice, and based on the feature amount extracted by the feature amount extracting unit, Voice model updating means for updating the voice model stored in the voice model storage means, and update parameters for controlling the degree of updating of the voice model by the voice model updating means are determined. Is configured as an adaptive speech recognition apparatus comprising comprises an update parameter determining means that.
According to the adaptive speech recognition device of the first aspect, when updating the speech model based on the feature amount of the input speech, the update degree of the speech model is controlled by the update parameter, so that it is meaningless. It is possible to suppress an adverse effect on the updated speech model due to an inappropriate or incorrect utterance.

【０００５】また，請求項２に係る発明は，上記請求項
１に記載の適応型音声認識装置において，上記更新パラ
メータ決定手段が，上記類似度計算手段により計算され
た上記類似度に基づいて上記更新パラメータを決定して
なることをその要旨とする。上記請求項２に記載の適応
型音声認識装置によれば，上記更新パラメータが上記類
似度に基づいて決定されるため，例えば上記類似度が高
くなるにつれて上記更新度合いが大きくなるように上記
更新パラメータを決定することにより，更新される上記
音声モデルに対する上記悪影響を抑えることができる。
また，請求項３に係る発明は，上記請求項１又は２に記
載の適応型音声認識装置において，特定の認識対象語の
発声を促すメッセージを決定し出力させる発声要求決定
手段を具備し，上記更新パラメータ決定手段が，上記発
声要求決定手段により決定された上記メッセージが出力
されてから所定時間内に上記入力音声が検出された場合
には，当該入力音声に対応した上記更新パラメータを上
記音声モデルの更新度合いが高まるように決定してなる
ことをその要旨とする。上記請求項３に記載の適応型音
声認識装置によれば，特定の認識対象語の発声を促すメ
ッセージが決定され，上記メッセージが出力されてから
所定時間内に上記入力音声が検出された場合には，当該
入力音声に対応した上記更新パラメータが上記音声モデ
ルの更新度合いが高まるように決定されるため，無意味
な発声や誤った発声に対して更新が行われる可能性を低
くすることができる。According to a second aspect of the present invention, in the adaptive speech recognition apparatus according to the first aspect, the updating parameter determining means determines the update parameter based on the similarity calculated by the similarity calculating means. The gist is that update parameters are determined. According to the adaptive speech recognition device of the second aspect, since the update parameter is determined based on the similarity, the update parameter is set such that, for example, the update degree increases as the similarity increases. Is determined, the adverse effect on the updated speech model can be suppressed.
According to a third aspect of the present invention, in the adaptive speech recognition apparatus according to the first or second aspect, there is provided an utterance request determining means for determining and outputting a message prompting utterance of a specific recognition target word, When the input parameter is detected within a predetermined time after the message determined by the utterance request determination module is output, the update parameter determining means converts the update parameter corresponding to the input voice into the voice model. The point is that the determination is made so as to increase the update degree of the. According to the adaptive speech recognition apparatus of the third aspect, a message prompting the utterance of a specific recognition target word is determined, and when the input speech is detected within a predetermined time after the message is output. Since the update parameter corresponding to the input speech is determined so that the degree of update of the speech model is increased, the possibility that the update is performed for a meaningless utterance or an erroneous utterance can be reduced. .

【０００６】また，請求項４に係る発明は，上記請求項
１〜３のいずれか１項に記載の適応型音声認識装置にお
いて，通常モード，及び発声内容を使用者に事前に通知
する学習モードのいずれか一方のモードを選択するモー
ド選択手段を具備し，上記更新パラメータ決定手段が，
上記モード選択手段により選択されたモードに従って上
記更新パラメータを決定してなることをその要旨とす
る。また，請求項５に係る発明は，上記請求項４に記載
の適応型音声認識装置において，上記更新パラメータ決
定手段が，上記モード選択手段により上記通常モードが
選択されている場合よりも，上記モード選択手段により
上記学習モードが選択されている場合に上記音声モデル
の更新度合いが高まるよう上記更新パラメータを決定し
てなることをその要旨とする。上記請求項４又は５に記
載の適応型音声認識装置によれば，発声内容を使用者に
事前に通知する学習モードと通常モードとで上記更新パ
ラメータの決定が変更されるため，学習モードでは更新
度合いを優先させ，通常モードでは無意味な発声等によ
る悪影響を抑えながら使用者の負担を軽減することがで
き，その結果効率的で効果的な学習を行わせることがで
きる。また，請求項６に係る発明は，上記請求項１〜５
のいずれか１項に記載の適応型音声認識装置において，
上記認識対象語選択判定手段により既に選択判定された
認識対象語とその類似度の履歴を記憶する認識結果履歴
記憶部を具備し，上記更新パラメータ決定手段が，上記
認識結果履歴記憶部に記憶されている認識対象語のうち
上記類似度の低い認識対象語に対応した上記更新パラメ
ータを上記音声モデルの更新度合いが高まるように決定
してなることをその要旨とする。上記請求項６に記載の
適応型音声認識装置によれば，既に認識選択判定された
認識対象語のうち上記類似度の低い認識対象語に対応し
た上記更新パラメータが上記音声モデルの更新度合いが
高まるように決定されるため，認識対象語の適応度合い
の偏りを軽減することができる。また，請求項７に係る
発明は，上記請求項３に記載の適応型音声認識装置にお
いて，上記認識対象語選択判定手段により既に選択判定
された認識対象語とその類似度の履歴を記憶する認識結
果履歴記憶部を具備し，上記発声内容要求部が，上記認
識結果履歴記憶部に記憶されている認識対象語のうち上
記類似度に応じて認識対象語の発声を促すメッセージを
優先的に決定してなることをその要旨とする。上記請求
項７に記載の適応型音声認識装置によれば，既に認識選
択判定された認識対象語のうち上記類似度に応じて認識
対象語の発声を促すメッセージが優先的に決定されるた
め，認識対象語の適応度合いの偏りを軽減することがで
きる。According to a fourth aspect of the present invention, there is provided an adaptive speech recognition apparatus according to any one of the first to third aspects, wherein a normal mode and a learning mode for notifying a user of utterance contents in advance. And a mode selecting means for selecting one of the modes, wherein the update parameter determining means comprises:
The gist is that the update parameter is determined according to the mode selected by the mode selection means. According to a fifth aspect of the present invention, in the adaptive speech recognition apparatus according to the fourth aspect, the update parameter determining means is configured to execute the update mode more than in the case where the normal mode is selected by the mode selection means. The gist is that the update parameter is determined so that the degree of update of the speech model is increased when the learning mode is selected by the selection unit. According to the adaptive speech recognition apparatus of the fourth or fifth aspect, the determination of the update parameter is changed between the learning mode for notifying the user of the utterance content in advance and the normal mode. In the normal mode, it is possible to reduce the burden on the user while giving priority to the degree, and to suppress the adverse effects of meaningless utterances and the like, and as a result, it is possible to perform efficient and effective learning. Further, the invention according to claim 6 is the invention according to claims 1 to 5 above.
In the adaptive speech recognition device according to any one of the above,
A recognition result history storage unit that stores a history of the recognition target words already selected and determined by the recognition target word selection determination unit and the similarity thereof; and the update parameter determination unit is stored in the recognition result history storage unit. The gist is that the update parameters corresponding to the recognition target words having the low similarity among the recognition target words are determined so that the degree of updating of the speech model is increased. According to the adaptive speech recognition apparatus of the sixth aspect, the degree of update of the speech model of the update parameter corresponding to the recognition target word having a low similarity among the recognition target words that have already been selected for recognition is increased. Thus, the bias of the adaptation degree of the recognition target word can be reduced. According to a seventh aspect of the present invention, there is provided the adaptive speech recognition apparatus according to the third aspect, wherein the recognition target word which has already been selected and determined by the recognition target word selection determining means and a history of the similarity thereof are stored. A result history storage unit, wherein the utterance content requesting unit preferentially determines a message prompting the utterance of the recognition target word according to the similarity among the recognition target words stored in the recognition result history storage unit The main point is to do. According to the adaptive speech recognition apparatus of the seventh aspect, a message prompting the utterance of the recognition target word is preferentially determined in accordance with the similarity among the recognition target words that have already been selected for recognition. The bias of the adaptation degree of the recognition target word can be reduced.

【０００７】また，請求項８に係る発明は，上記請求項
１〜７のいずれか１項に記載の適応型音声認識装置にお
いて，上記音声モデルに予め登録された第１のモデル
と，使用者により独自に登録された第２のモデルとが含
まれる場合に，上記更新パラメータ決定手段が，上記第
１のモデルに属する上記音声モデルよりも，上記音声モ
デルの更新度合いが高まるよう，上記第２のモデルに属
する上記音声モデルに対応する更新パラメータを決定し
てなることをその要旨とする。上記請求項８に記載の適
応型音声認識装置によれば，使用者により独自に登録さ
れた音声モデルに対して優先的に話者適応化をさせるこ
とができる。また，請求項９に係る発明は，上記請求項
３又は７に記載の適応型音声認識装置において，上記音
声モデルに予め登録された第１のモデルと，使用者によ
り独自に登録された第２のモデルとが含まれる場合に，
上記発声内容要求部が，上記第１のモデルに属する上記
音声モデルよりも高い頻度で，上記第２のモデルに属す
る上記音声モデルに対応した認識対象語の発声を促すメ
ッセージを決定してなることをその要旨とする。上記請
求項９に記載の適応型音声認識装置によれば，使用者に
より独自に登録された音声モデルに対応した認識対象語
の発声を促すメッセージを優先的に発声させて，結果的
に使用者により独自に登録された音声モデルに対して優
先的に話者適応化をさせることができる。また，請求項
１０に係る発明は，上記請求項１〜９のいずれか１項に
記載の適応型音声認識装置において，上記更新パラメー
タ決定手段が，上記入力音声の長さに基づいて上記更新
パラメータに重み付けを行うものであることをその要旨
とする。上記請求項１０に記載の適応型音声認識装置に
よれば，上記入力音声の長さに従って上記入力音声が更
新に適当なものでるか推定されるため，無意味な発声や
誤った発声により，更新される上記音声モデルに与える
悪影響を抑えることができる。According to an eighth aspect of the present invention, in the adaptive speech recognition apparatus according to any one of the first to seventh aspects, a first model registered in advance in the speech model and a user In the case where a second model uniquely registered by the first model is included, the update parameter determining means determines that the degree of update of the second voice model is higher than that of the second voice model belonging to the first model. The point is that update parameters corresponding to the voice model belonging to the above model are determined. According to the adaptive speech recognition apparatus of the eighth aspect, the speaker adaptation can be preferentially performed on the speech model uniquely registered by the user. According to a ninth aspect of the present invention, in the adaptive speech recognition apparatus according to the third or seventh aspect, the first model registered in advance in the speech model and the second model independently registered by the user. And the model of
The utterance content request unit determines, at a higher frequency than the speech model belonging to the first model, a message prompting the utterance of a recognition target word corresponding to the speech model belonging to the second model. Is the gist. According to the adaptive speech recognition apparatus of the ninth aspect, a message prompting the user to utter a recognition target word corresponding to a speech model uniquely registered by the user is preferentially uttered, and as a result, the user Thus, it is possible to preferentially perform speaker adaptation on a uniquely registered voice model. According to a tenth aspect of the present invention, in the adaptive speech recognition apparatus according to any one of the first to ninth aspects, the update parameter determining means determines the update parameter based on a length of the input speech. The main point is that the weighting is performed on. According to the adaptive speech recognition apparatus of the tenth aspect, it is estimated whether the input speech is appropriate for updating according to the length of the input speech. This can suppress the adverse effect on the speech model.

【０００８】また，請求項１１に係る発明は，上記請求
項１〜１０のいずれか１項に記載の適応型音声認識装置
において，上記更新パラメータ決定手段が，連続した入
力音声に対応した類似度が所定のしきい値以上の場合
に，上記更新パラメータを前記所定のしきい値以下の場
合よりも大きな値に設定してなることをその要旨とす
る。また，請求項１２に係る発明は，上記請求項１１に
記載の適応型音声認識装置において，上記連続した入力
音声が，上記音声モデルに予め登録された第１のモデル
に対応した認識対象語と，使用者により独自に登録され
た第２のモデルに対応した認識対象語との組合せである
ことをその要旨とする。上記請求項１１又は１２に記載
の適応型音声認識装置によれば，連続した入力音声に対
応した類似度が所定のしきい値以上の場合に，上記音声
モデルの更新が行われるため，無意味な発声や誤った発
声に対して音声モデルの更新が行われる可能性を低くす
ることができる。また，請求項１３に係る発明は，上記
請求項１〜１２のいずれか１項に記載の適応型音声認識
装置において，上記音声モデル記憶手段が，不揮発性で
読み出し専用の第１の記憶手段と，不揮発性で書き換え
可能な第２の記憶手段とを含み，予め登録されている音
声モデルは，上記第１の記憶手段に格納し，上記音声モ
デル更新手段により更新された音声モデルの更新部分
は，上記第２の記憶手段に格納してなることをその要旨
とする。また，請求項１４に係る発明は，上記請求項１
３に記載の適応型音声認識装置において，上記音声モデ
ル更新手段により更新された上記音声モデルが揮発性の
記憶手段に記憶されている場合に，電源がオフされる
と，上記揮発性の記憶手段に記憶されている更新済の音
声モデルから，上記第１の記憶手段に記憶されている音
声モデルを差し引いて上記音声モデルの更新部分を生成
し，上記第２の記憶手段に格納してなることをその要旨
とする。また，請求項１５に係る発明は，上記請求項１
４に記載の適応型音声認識装置において，電源がオンさ
れたときに，上記第１の記憶手段に記憶されている上記
音声モデルと上記第２の記憶手段に記憶されている上記
音声モデルの更新部分とを加算して上記揮発性の記憶手
段に転送してなることをその要旨とする。また，請求項
１６に係る発明は，上記請求項１５に記載の適応型音声
認識装置において，上記第１の記憶手段がＲＯＭであっ
て，上記第２の記憶手段がブロック毎に消去を行うフラ
ッシュメモリであることをその要旨とする。上記請求項
１３〜１６のいずれか１項に記載の適応型音声認識装置
によれば，電源が不意にオフされた場合でも，不揮発性
で読み出し専用の第１の記憶手段に少なくとも初期の音
声モデルが記憶されているため，音声モデルの完全な消
失を防止することができる。According to an eleventh aspect of the present invention, in the adaptive speech recognition apparatus according to any one of the first to tenth aspects, the update parameter determining means includes a similarity measure corresponding to a continuous input speech. The point is that the update parameter is set to a larger value when the value is equal to or more than the predetermined threshold value than when the value is equal to or less than the predetermined threshold value. According to a twelfth aspect of the present invention, in the adaptive speech recognition apparatus according to the eleventh aspect, the continuous input speech is a recognition target word corresponding to a first model registered in advance in the speech model. The gist is that the combination is a combination with a recognition target word corresponding to the second model uniquely registered by the user. According to the adaptive speech recognition apparatus of the present invention, when the similarity corresponding to the continuous input speech is equal to or greater than a predetermined threshold, the speech model is updated, so that it is meaningless. It is possible to reduce the possibility that the voice model is updated for an inappropriate utterance or an incorrect utterance. According to a thirteenth aspect of the present invention, in the adaptive speech recognition apparatus according to any one of the first to twelfth aspects, the speech model storage means is a non-volatile first read-only storage means. And a non-volatile and rewritable second storage means, wherein a pre-registered voice model is stored in the first storage means, and an updated part of the voice model updated by the voice model updating means is provided. , Is stored in the second storage means. Further, the invention according to claim 14 is based on the first aspect.
3. In the adaptive speech recognition device according to item 3, when the speech model updated by the speech model updating means is stored in the volatile storage means, the volatile storage means is turned off when the power is turned off. Subtracting the speech model stored in the first storage means from the updated speech model stored in the storage section to generate an updated part of the speech model, and storing the updated part in the second storage means. Is the gist. The invention according to claim 15 is the invention according to claim 1.
4. In the adaptive speech recognition device according to 4, when the power is turned on, the speech model stored in the first storage means and the speech model stored in the second storage means are updated. The gist of the present invention is that the part is added and transferred to the volatile storage means. According to a sixteenth aspect of the present invention, in the adaptive speech recognition apparatus according to the fifteenth aspect, the first storage means is a ROM, and the second storage means performs erasing for each block. The gist is that it is a memory. According to the adaptive speech recognition apparatus of any one of claims 13 to 16, even when the power supply is unexpectedly turned off, at least the initial speech model is stored in the nonvolatile, read-only first storage means. Is stored, the complete disappearance of the speech model can be prevented.

【０００９】また，請求項１７に係る発明は，上記請求
項１６に記載の適応型音声認識装置において，上記フラ
ッシュメモリの２つ以上のブロックを用いる場合に，上
記２つ以上のブロックのうち消去状態にあるブロックに
上記音声モデルの更新部分と，エラーチェック用データ
と，上記フラッシュメモリの初期化後の書き込み回数と
を格納し，上記初期化後の書き込み回数が最も多いブロ
ックから，上記エラーチェック用データにより上記音声
モデルの更新部分が正常であるかを判別し，正常である
と判別された上記音声モデルの更新部分のうち最も上記
初期化後の書き込み回数が多いブロックにある上記音声
モデルの更新部分を上記揮発性の記憶手段に転送し，正
常でないと判別された上記音声モデルの更新部分がある
ブロックから上記音声モデルの更新部分を消去してなる
ことをその要旨とする。上記請求項１７に記載の適応型
音声認識装置によれば，２つ以上のブロックのうち消去
状態にあるブロックに，上記音声モデルの更新部分と，
エラーチェック用データと，上記フラッシュメモリの初
期化後の書き込み回数とが格納され，上記初期化後の書
き込み回数が最も多いブロックから，上記エラーチェッ
ク用データにより上記音声モデルの更新部分が正常であ
るかが判別され，正常であると判別された上記音声モデ
ルの更新部分のうち最も上記初期化後の書き込み回数が
多いブロックにある上記音声モデルの更新部分が上記揮
発性の記憶手段に転送され，正常でないと判別された上
記音声モデルの更新部分があるブロックから上記音声モ
デルの更新部分が消去されるため，電源が不意にオフさ
れ，上記揮発性の記憶手段に記憶されていた最新の更新
に係る音声モデルが消失したとしても，その前の更新状
態や，その前々回の更新状態が上記フラッシュメモリを
上記揮発性の記憶手段に転送することができる。従っ
て，音声モデルの学習のやり直しがほとんど必要がなく
なる。According to a seventeenth aspect of the present invention, in the adaptive speech recognition apparatus according to the sixteenth aspect, when two or more blocks of the flash memory are used, the two or more blocks are erased. The updated block of the voice model, error check data, and the number of times of writing after initialization of the flash memory are stored in the block in the state, and the error check is performed from the block having the largest number of times of writing after initialization. It is determined whether or not the updated part of the audio model is normal based on the data for use. Of the updated parts of the audio model determined to be normal, the updated part of the audio model in the block having the largest number of write operations after the initialization is determined. The updated part is transferred to the volatile storage means, and from the block having the updated part of the speech model determined to be abnormal, To become clear the update part of the voice model and the gist thereof. According to the adaptive speech recognition apparatus of the seventeenth aspect, an updated part of the speech model is added to a block in an erased state among two or more blocks.
The error check data and the number of writes after the initialization of the flash memory are stored, and the updated part of the voice model is normal by the error check data from the block having the largest number of writes after the initialization. The updated part of the audio model in the block having the largest number of times of writing after initialization among the updated parts of the audio model determined to be normal is transferred to the volatile storage means, Since the updated part of the voice model is erased from the block containing the updated part of the voice model determined to be abnormal, the power supply is suddenly turned off and the latest update stored in the volatile storage means is restored. Even if such a voice model is lost, the previous update state or the update state just before the previous one stores the flash memory in the volatile storage. It can be transferred to the stage. Therefore, there is almost no need to repeat the learning of the speech model.

【００１０】また，請求項１８に係る発明は，上記請求
項１〜１７のいずれか１項に記載の適応型音声認識装置
と，上記認識対象語選択判定手段により選択判定された
上記認識対象語に基づいて，予め上記認識対象語に対応
して記憶された応答内容を選択し，上記入力音声に対す
る応答を制御する応答制御手段とを具備してなる音声処
理装置として構成されている。また，請求項１９に係る
発明は，上記請求項１８に記載の音声処理装置におい
て，上記応答内容が複数段階の応答レベルに分割されて
記憶されており，上記類似度計算手段により計算された
類似度に基づいて上記応答レベルを決定する応答レベル
決定手段を具備してなることをその要旨とする。また，
請求項２０に係る発明は，上記請求項１９に記載の音声
処理装置において，上記応答レベル決定手段が，各応答
レベルに対して上記類似度のしきい値をそれぞれ設定
し，上記類似度と複数のしきい値を比較することによ
り，上記応答レベルを決定してなることをその要旨とす
る。また，請求項２１に係る発明は，上記請求項１９に
記載の音声処理装置において，上記応答レベル決定手段
が，各応答レベルに対して異なる係数を用いて上記類似
度に対して演算処理を行い，上記係数を用いて演算処理
が行われた後の上記類似度と所定の唯一のしきい値とを
比較することにより，上記応答レベルを決定してなるこ
とをその要旨とする。上記請求項１８〜２１のいずれか
１項に記載の音声処理装置によれば，上記のように好適
な更新が行われた音声モデルに基づいて認識された認識
対象語に対応して使用者へ応答が行われるため，適切な
応答を選択することができる。また，請求項２２に係る
発明は，上記請求項２０に記載の音声処理装置におい
て，上記入力音声がある度に上記複数のしきい値のうち
の一部又は全部を変化させてなることをその要旨とす
る。また，請求項２３に係る発明は，上記請求項２１に
記載の音声処理装置において，上記入力音声がある度に
上記係数のうちの一部又は全部を変化させてなることを
その要旨とする。上記請求項２２又は２３に記載の音声
処理装置によれば，上記入力音声がある度に応答の発声
が変化させられるため，応答の多様性を確保することが
できる。また，請求項２４に係る発明は，上記請求項１
８〜２３のいずれか１項に記載の音声処理装置と，音声
合成手段と，可動部を駆動する駆動手段とを具備し，上
記応答制御手段によって上記音声合成手段と上記駆動手
段を制御することにより，上記入力音声に対する応答を
行うペット玩具として構成されている。上記請求項２４
に記載のペット玩具によれば，上記更新パラメータによ
り上記音声モデルの更新度合いを調整することで，ペッ
トが飼い主に徐々に馴染んでいく現象を模倣することに
より，使用者がペット玩具への愛着を持ちやすくするこ
とができる。The invention according to claim 18 is directed to an adaptive speech recognition apparatus according to any one of claims 1 to 17, and the recognition target word selected and determined by the recognition target word selection determining means. And a response control means for controlling a response to the input voice by selecting a response content stored in advance corresponding to the recognition target word on the basis of the above. According to a nineteenth aspect of the present invention, in the speech processing apparatus according to the eighteenth aspect, the response content is divided into a plurality of response levels and stored, and the similarity calculated by the similarity calculating means is stored. The gist of the invention is to provide a response level determining means for determining the response level based on the degree. Also,
According to a twentieth aspect of the present invention, in the speech processing apparatus according to the nineteenth aspect, the response level determining means sets a threshold value of the similarity for each response level. The point is that the response level is determined by comparing the threshold values of the above. According to a twenty-first aspect of the present invention, in the speech processing apparatus according to the nineteenth aspect, the response level determining means performs an arithmetic process on the similarity using a different coefficient for each response level. The gist is that the response level is determined by comparing the similarity after the arithmetic processing is performed using the coefficient with a predetermined unique threshold value. According to the speech processing device of any one of claims 18 to 21, the user is provided to the user in correspondence with the recognition target word recognized based on the speech model that has been suitably updated as described above. Since a response is made, an appropriate response can be selected. According to a twenty-second aspect of the present invention, in the audio processing device according to the twentieth aspect, it is preferable that a part or all of the plurality of thresholds is changed each time the input voice is present. Make a summary. According to a twenty-third aspect of the present invention, in the audio processing device according to the twenty-first aspect, a part or all of the coefficients are changed each time the input voice is present. According to the voice processing device of the present invention, the utterance of the response is changed each time the input voice is present, so that it is possible to ensure a variety of responses. The invention according to claim 24 is the above-mentioned claim 1.
24. An audio processing apparatus according to any one of claims 8 to 23, an audio synthesizing means, and a driving means for driving a movable portion, wherein the response control means controls the audio synthesizing means and the driving means. Thus, the toy is configured to respond to the input voice. Claim 24
According to the pet toy described in (1), by adjusting the degree of updating of the voice model with the updating parameter, the pet can gradually imitate the phenomenon that the pet gradually adapts to the owner, thereby allowing the user to attach to the pet toy. Easy to hold.

【００１１】[0011]

【発明の実施の形態】以下，添付図面を参照して，本発
明の実施の形態につき説明し，本発明の理解に供する。
尚，以下の実施の形態は，本発明の具体的な一例であっ
て，本発明の技術的範囲を限定する性格のものではな
い。ここに，図１は本発明の一実施の形態に係る適応型
音声認識装置の概略構成を示す機能ブロック図であり，
図２は上記適応型音声認識装置に必要なハードウェア構
成の一例を示す図である。まず，本発明の一実施の形態
に係る適応型音声認識装置Ａ１は，図１に示す如く，入
力音声の特徴量を抽出する特徴抽出部（特徴量抽出手段
に相当）１と，認識対象語に対応する音声モデルを記憶
する音声モデル記憶部（音声モデル記憶手段に相当）２
と，上記特徴抽出部１により抽出された上記特徴量と，
上記音声モデル記憶部２に記憶された上記音声モデルと
の類似度を計算する評価値計算部（類似度計算手段に相
当）３と，上記評価値計算部３により計算された上記類
似度に基づいて，上記入力音声に対応する認識対象語の
選択判定を行う結果判定部（認識対象語選択判定手段に
相当）４と，上記特徴抽出部１により抽出された上記特
徴量に基づいて，上記音声モデル記憶部２に記憶された
上記音声モデルを更新する音声モデル更新部（音声モデ
ル更新手段に相当）５と，上記音声モデル更新部５によ
る上記音声モデルの更新度合いを制御する更新パラメー
タを決定する更新パラメータ決定部（更新パラメータ決
定手段に相当）６とを具備して構成されている。上記適
応型音声認識装置を具体化するには，例えば図２に示す
ようなマイク１０１，Ａ／Ｄ変換器１０２，プロセッサ
１０３，ＲＯＭ１０４，フラッシュメモリ１０５，ワー
クメモリ１０６を含むハードウェアが必要である。上記
各構成要素のうち，上記音声モデル記憶部２の具体例
が，上記ＲＯＭ１０４及びフラッシュメモリ１０５であ
り，その他の上記特徴抽出部１，類似度計算部３，認識
語判定部４，音声モデル更新部５，更新パラメータ決定
部６は，当該各構成要素に対応した処理が記述されたプ
ログラムを上記ワークメモリ１０６を用いながら上記プ
ロセッサ１０３に実行させることにより実現することが
できる。Embodiments of the present invention will be described below with reference to the accompanying drawings to provide an understanding of the present invention.
The following embodiment is a specific example of the present invention and does not limit the technical scope of the present invention. FIG. 1 is a functional block diagram showing a schematic configuration of an adaptive speech recognition apparatus according to one embodiment of the present invention.
FIG. 2 is a diagram showing an example of a hardware configuration required for the adaptive speech recognition device. First, as shown in FIG. 1, an adaptive speech recognition apparatus A1 according to an embodiment of the present invention includes a feature extraction unit (corresponding to feature extraction means) 1 for extracting a feature of an input speech, and a recognition target word. Voice model storage unit (corresponding to voice model storage means) 2 for storing a voice model corresponding to.
And the feature amount extracted by the feature extraction unit 1;
An evaluation value calculation unit (corresponding to similarity calculation means) 3 for calculating a similarity with the voice model stored in the voice model storage unit 2 and a similarity calculated by the evaluation value calculation unit 3 A result determination unit (corresponding to a recognition target word selection determining unit) 4 for performing selection determination of a recognition target word corresponding to the input voice, and the speech A speech model updating unit (corresponding to a speech model updating unit) 5 for updating the speech model stored in the model storage unit 2 and an update parameter for controlling the degree of updating of the speech model by the speech model updating unit 5 are determined. And an update parameter determination unit (corresponding to an update parameter determination unit) 6. In order to implement the adaptive speech recognition apparatus, hardware including, for example, a microphone 101, an A / D converter 102, a processor 103, a ROM 104, a flash memory 105, and a work memory 106 as shown in FIG. . Among the above components, a specific example of the voice model storage unit 2 is the ROM 104 and the flash memory 105, and the other feature extraction unit 1, similarity calculation unit 3, recognition word determination unit 4, voice model update The unit 5 and the update parameter determination unit 6 can be realized by causing the processor 103 to execute a program in which processing corresponding to each component is described, using the work memory 106.

【００１２】次に，上記適応型音声認識装置Ａ１の基本
的な動作について説明する。上記適応型音声認識装置Ａ
１において，マイク１０１から入力された入力音声は，
特徴抽出部１へ供給される。（これを図２上で表現すれ
ば，マイク１０１から入力された入力音声がＡ／Ｄ変換
器１０２などを経てワークメモリ１０６に供給され，ソ
フトウェアの実行により上記特徴抽出部１に対応する機
能を実現した上記プロセッサ１０３により参照可能にさ
れるとなるが，プログラムを実行する場合の，プロセッ
サ１０３やワークメモリ１０６等の関係は周知のものと
同様であるので，以下では特に必要のない限り図１に準
じた機能上の表現を用いる）。上記特徴抽出部１では，
上記入力音声がフレームと呼ばれる所定単位時間毎に分
割され，例えば周知のＬＰＣケプストラムや，ＬＰＣメ
ルケプストラムといった特徴量が抽出される。尚，上記
特徴量の種類は，上記した２つに限定されるものではな
く，その次数等も限られるものではない。また，上記音
声モデル記憶部２には，認識対象語に対応するＨＭＭモ
デルや，ＤＰマッチングにおける音声特徴量ベクトル列
などの音声モデルが予め記憶されており，上記特徴抽出
部１により上記特徴量が抽出されると，上記評価値計算
部３により，上記特徴量と上記音声モデルとの類似度が
計算される。例えば上記音声モデル記憶部１に，音素や
音節，単語ごとに構築されたＨＭＭモデルが記憶されて
いる場合には，上記類似度は上記入力音声に対する各Ｈ
ＭＭ単語の尤度で表される。ＨＭＭモデルによる場合，
前記尤度が大きいほど類似度が大きいことになる。この
際，上記ＨＭＭモデルが，音素や音節ごとに構築されて
いれば，各ＨＭＭモデルが連結されて上記ＨＭＭ単語が
作成される。上記評価値計算部３により計算された各Ｈ
ＭＭ単語ごとの類似度は，結果判定部４に供給される。
上記結果判定部４では，各ＨＭＭ単語ごとに計算された
上記類似度が参照され，例えば上記ＨＭＭ単語の中で上
記類似度が最も大きいものが，認識単語として選択判定
される。また，上記特徴量がＤＰマッチングにおける音
声特徴量ベクトル列である場合には，特徴量の距離が小
さいほど類似度が大きいことになるから，上記結果判定
部４において，上記距離が最も小さいものが認識単語と
して選択判定される。Next, the basic operation of the adaptive speech recognition apparatus A1 will be described. The above adaptive speech recognition device A
In 1, the input voice input from the microphone 101 is
It is supplied to the feature extraction unit 1. (If this is expressed in FIG. 2, the input voice input from the microphone 101 is supplied to the work memory 106 via the A / D converter 102 and the like, and the function corresponding to the feature extracting unit 1 is executed by executing the software. Although the processor 103 can be referred to by the realized processor 103, the relationship between the processor 103 and the work memory 106 in executing the program is the same as that of the well-known one. Functional expression according to the above). In the feature extraction unit 1,
The input speech is divided into predetermined units of time called frames, and feature quantities such as well-known LPC cepstrum and LPC mel cepstrum are extracted. Note that the types of the feature amounts are not limited to the two described above, and the order and the like are not limited. The speech model storage unit 2 stores in advance an HMM model corresponding to the recognition target word and a speech model such as a speech feature vector sequence in DP matching. Once extracted, the evaluation value calculation unit 3 calculates the similarity between the feature quantity and the speech model. For example, when the speech model storage unit 1 stores an HMM model constructed for each phoneme, syllable, or word, the similarity is calculated for each H with respect to the input speech.
It is represented by the likelihood of the MM word. In the case of the HMM model,
The greater the likelihood, the greater the similarity. At this time, if the HMM model is constructed for each phoneme or syllable, the HMM models are linked to create the HMM word. Each H calculated by the evaluation value calculator 3
The similarity for each MM word is supplied to the result determination unit 4.
The result determination unit 4 refers to the similarity calculated for each HMM word, and selects and determines, for example, a word having the highest similarity among the HMM words as a recognition word. Further, when the feature amount is a speech feature amount vector sequence in DP matching, the smaller the distance of the feature amount, the higher the similarity. It is selected and determined as a recognition word.

【００１３】このようにして入力音声に対する音声認識
が行われるが，不特定の話者を対象としている場合に
は，全ての話者に対して高い認識率を確保することは困
難であるので，学習によって，使用者に適応させた方が
好ましい場合がある。この適応は，上記音声モデル記憶
部２に記憶された上記音声モデルを，上記入力音声の特
徴量に応じて更新することを意味する。上記適応型音声
認識装置Ａ１における音声モデル更新部５は，このため
のものである。音声認識手法（ＨＭＭか，ＤＰマッチン
グか，ＨＭＭでもそのデータの持ち方）によって異なる
が，上記音声モデルが例えば連続混合分布ＨＭＭの場合
には，音声モデル中の混合正規分布の特徴量平均ベクト
ルを入力音声のそれに近づけることにより，上記音声モ
デルを話者に適応させる上記更新が行われる。より具体
的には，上記類似度が計算された時（若しくは上記認識
語の判定が行われた後に再度上記類似度を計算する際）
に，図３に示す如く，最適尤度を算出する状態遷移経路
の情報が記憶され，各状態の最終フレームから最初のフ
レームへ順番にバックトラック処理によって，対応する
状態が求められる。尚，最初のフレームから最終フレー
ムまでを状態数で均等に割り振るという手法を用いても
よい。[0013] Speech recognition for input speech is performed in this manner. However, when an unspecified speaker is targeted, it is difficult to ensure a high recognition rate for all speakers. It may be preferable to adapt to the user by learning. This adaptation means updating the voice model stored in the voice model storage unit 2 according to the feature amount of the input voice. The speech model updating unit 5 in the adaptive speech recognition device A1 is for this purpose. Although it differs depending on the speech recognition method (HMM, DP matching, or how the data is held in the HMM), when the above-mentioned speech model is, for example, a continuous mixture distribution HMM, the feature amount average vector of the mixture normal distribution in the speech model is calculated as follows. The update that adapts the speech model to the speaker is performed by approaching that of the input speech. More specifically, when the similarity is calculated (or when the similarity is calculated again after the recognition word is determined).
As shown in FIG. 3, information on the state transition path for calculating the optimum likelihood is stored, and the corresponding state is obtained by backtracking in order from the last frame to the first frame of each state. Note that a method of equally allocating the number of states from the first frame to the last frame may be used.

【００１４】このようにして各フレームの音声特徴量ベ
クトルが状態に割り付けられると，次に図４に示す如
く，音声モデルＨＭＭの各状態ごとに割りつけられた音
声特徴量ベクトルの平均ベクトルνｉ（ｎ）（ｉは状態
番号，ｎはベクトル次数）が算出される。これが，スペ
クトル内挿によって適応化される。即ち，適応化前のＨ
ＭＭの平均ベクトルをμｉｊ（ｎ）（ｉは状態番号，ｊ
は混合番号，ｎはベクトル次数）とすると，移動ベクト
ルΔｉｊ（ｎ）は，次式（１）に従って計算される。 Δｉｊ（ｎ）＝νｉ（ｎ）−μｉｊ（ｎ）（１）上記移動ベクトルΔｉｊ（ｎ）を用いると，ＨＭＭの更
新後の平均ベクトルは，次式で表現される。 μ’ｉｊ（ｎ）＝μｉｊ（ｎ）＋ｋΔｉｊ（ｎ）（２）但し，ｋは適応化における更新パラメータであり，ｋの
範囲は〔０，１〕である。この更新パラメータｋは，上
記音声モデル記憶部２に記憶されている音声モデルの入
力音声に対する更新度合いを制御するためのものであ
る。上記更新パラメータｋが設定されず，全ての入力音
声に対して平等に更新が行われると，無意味な単語の発
声や誤った発声などによって上記音声モデルが改悪され
てしまう恐れがある。そこで，本発明に係る音声認識装
置Ａ１では，学習に好ましいと思われる所定の条件に従
って上記更新パラメータｋを決定することにより，好ま
しくない学習を避けるように，上記更新パラメータｋを
当該入力音声について決定する更新パラメータ決定部６
が設けられている。上記更新パラメータ決定部６では，
例えば上記類似度に基づいて上記更新パラメータｋが決
定される。上記類似度が高い場合には上記更新パラメー
タｋが大きな値にされ，上記類似度が低い場合には上記
更新パラメータｋが低い値にされる。When the speech feature vector of each frame is assigned to a state in this way, as shown in FIG. 4, the average vector νi () of the speech feature vector assigned to each state of the speech model HMM is next obtained. n) (i is the state number, n is the vector order). This is accommodated by spectral interpolation. That is, H before adaptation
The average vector of MM is expressed as μij (n) (i is the state number, j
Is the mixing number, and n is the vector order), the movement vector Δij (n) is calculated according to the following equation (1). Δij (n) = νi (n) −μij (n) (1) Using the above movement vector Δij (n), the average vector after updating the HMM is expressed by the following equation. μ′ij (n) = μij (n) + kΔij (n) (2) where k is an update parameter in adaptation, and the range of k is [0, 1]. The update parameter k is for controlling the degree of update of the speech model stored in the speech model storage unit 2 with respect to the input speech. If the update parameter k is not set and all input voices are updated equally, there is a possibility that the voice model may be deteriorated due to utterance of a meaningless word or erroneous utterance. Therefore, in the speech recognition apparatus A1 according to the present invention, the update parameter k is determined for the input speech so as to avoid undesired learning by determining the update parameter k according to a predetermined condition that is considered preferable for learning. Update parameter determination unit 6
Is provided. In the update parameter determination unit 6,
For example, the update parameter k is determined based on the similarity. When the similarity is high, the update parameter k is set to a large value, and when the similarity is low, the update parameter k is set to a low value.

【００１５】図５の例では，上記類似度が４段階に区分
けされており，この区分けのために３つのしきい値１
ｔ，２ｔ，３ｔが設定されている。上記区分けごとに上
記更新パラメータｋの値が設定されている。上記類似度
が，しきい値１ｔよりも小さい場合，決定される上記更
新パラメータｋの値は０であり，しきい値ｔ１からしき
い値ｔ２の間では，ｋ＝１であり，しきい値２ｔからし
きい値３ｔの間では，ｋ＝２であり，しきい値３ｔ以上
では，ｋ＝０．４である。即ち，類似度が１ｔよりも小
さいｃａｓｅ１の場合，無意味な単語の発声や誤った認
識による結果と判別され，ｋ＝０となり，適応学習は行
われない。上記類似度が大きくなれば，その大きさに従
ってｋ＝０．１，０．２，０．４と，より大きな値が与
えられる。また，上記更新パラメータｋの設定は，上記
類似度の大きさに限らず，例えば使用者から学習である
と意図的に指示された場合に，大きな値を与えるように
してもよい。通常の認識動作を行う通常モードと，発声
内容を使用者に事前に通知する学習モードとのいずれか
一方のモードを選択するためのスイッチ７（モード選択
手段に対応）がある場合には，上記学習モード側にスイ
ッチ７の切り替えが行われると，上記更新パラメータ決
定部６でその切り替えが検出され，例えば更新パラメー
タｋが０．８という高い値に設定される。この場合の更
新パラメータｋの値は，上記類似度の大きさに必ずしも
従っている必要はなく一定値でも構わない。これは，上
記した通り，学習モードでは，基本的に装置使用者に予
め発声する内容が通知され，使用者はその通知内容に従
って発声を行うという前提があるためである。In the example shown in FIG. 5, the similarity is divided into four levels, and three thresholds 1 are used for this division.
t, 2t, and 3t are set. The value of the update parameter k is set for each of the categories. When the similarity is smaller than the threshold value 1t, the determined value of the update parameter k is 0, and between the threshold value t1 and the threshold value t2, k = 1. From 2t to threshold 3t, k = 2, and above threshold 3t, k = 0.4. That is, when the similarity is case 1 smaller than 1t, the result is determined as a result of utterance of a meaningless word or erroneous recognition, k = 0, and no adaptive learning is performed. If the similarity increases, k = 0.1, 0.2, 0.4, which is larger, is given in accordance with the magnitude. Further, the setting of the update parameter k is not limited to the magnitude of the similarity, and may be set to a large value when the user intentionally indicates that the learning is required. If there is a switch 7 (corresponding to the mode selection means) for selecting one of a normal mode for performing a normal recognition operation and a learning mode for notifying the user of the utterance content in advance, When the switch 7 is switched to the learning mode, the switch is detected by the update parameter determination unit 6, and the update parameter k is set to a high value of 0.8, for example. In this case, the value of the update parameter k does not necessarily have to follow the magnitude of the similarity, and may be a constant value. This is because, as described above, in the learning mode, the device user is basically notified in advance of the content to be uttered, and the user is supposed to utter according to the notification content.

【００１６】もちろん，図６に示す如く，上記学習モー
ドにおいても上記類似度に応じて上記更新パラメータｋ
に与える値を変化させても構わない。図６の例では，通
常モードでは，図５の場合と同様上記更新パラメータｋ
に０，０．１，０．２，０．４という値が与えられてい
るのに対し，学習モードではそれらの値よりも大きい，
０．１，０．２，０．４，０．８という値が与えられ
る。このように学習モードにおいては，大きな値を更新
パラメータｋに与えることにより，適応度合いを強め，
学習効率を向上させることができる。さらに，音声区間
長さによる重み付けを行うことで，より確実な話者適応
が可能である。一般的に，単語の発声においては，発声
時間はその文字数に比例する傾向にある。このため，発
声時間を考慮すれば，ある程度，意味のある発声である
か否かを評価することができる。そこで，上記（２）式
に発声時間に応じた重み付けｌを，次式（３）のように
導入する。 μ’ｉｊ（ｎ）＝μｉｊ（ｎ）＋ｌ×ｋ×Δｉｊ（ｎ）（３）上記重み付けｌは，不特定話者の大量の音声からＨＭＭ
を予め学習する際に，統計的にその単語ＨＭＭの音声長
さを計測しておき，音声モデルの構築とともに格納して
おけば，その分布に従って定めることができる。即ち，
多くの入力音声が，所定の発声時間に偏っていれば，そ
の発声時間に対して大きな値の重みが割りつけられる。
例えば図７では，音声区間長が２．０秒から３．０秒ま
での範囲にある場合に最も大きい１．０という値が与え
られており，その範囲の両側では値は低下させられお
り，音声区間長が１．０秒よりも小さい場合や，４．０
秒よりも大きい場合には０という値が重みに与えられて
いる。これは，音声区間長が１．０秒よりも小さい場合
や，４．０秒よりも大きい場合には，無意味な発声であ
ったりする可能性が高いためである。このようにして上
記更新パラメータｋや重みｌの値を決定することによ
り，本実施の形態に係る適応型音声認識装置では，音声
モデルを特定の話者に適応させる際に，無意味な発声や
誤った認識によって音声モデルに与える悪影響が抑制さ
れる。Of course, as shown in FIG. 6, even in the learning mode, the update parameter k is set according to the similarity.
May be changed. In the example of FIG. 6, in the normal mode, as in the case of FIG.
Are given values of 0, 0.1, 0.2, and 0.4, whereas in learning mode they are larger than those values.
The values 0.1, 0.2, 0.4, 0.8 are given. As described above, in the learning mode, a large value is given to the update parameter k to enhance the adaptation degree.
Learning efficiency can be improved. Furthermore, by performing weighting based on the voice section length, more reliable speaker adaptation is possible. In general, when a word is uttered, the utterance time tends to be proportional to the number of characters. For this reason, by considering the utterance time, it is possible to evaluate to a certain extent whether the utterance is meaningful. Therefore, a weighting l according to the utterance time is introduced into the above equation (2) as in the following equation (3). μ ′ ij (n) = μ ij (n) + 1 × k × Δ ij (n) (3) The weighting l is obtained from a large amount of speech of an unspecified speaker by HMM
When learning in advance, the speech length of the word HMM is statistically measured and stored together with the construction of the speech model, so that it can be determined according to the distribution. That is,
If many input voices are biased to a predetermined utterance time, a large weight is assigned to the utterance time.
For example, in FIG. 7, the maximum value of 1.0 is given when the voice section length is in the range from 2.0 seconds to 3.0 seconds, and the value is reduced on both sides of the range. When the voice section length is shorter than 1.0 second or 4.0
If it is greater than seconds, a value of 0 is given to the weight. This is because when the voice section length is shorter than 1.0 second or longer than 4.0 seconds, there is a high possibility that the voice is meaningless. By determining the values of the update parameter k and the weight 1 in this manner, the adaptive speech recognition apparatus according to the present embodiment can provide a speech model with no meaningful utterance or speech when adapting the speech model to a specific speaker. The adverse effects on the speech model due to incorrect recognition are suppressed.

【００１７】次に，上記適応型音声認識装置Ａ１を応用
した音声処理装置Ａ２，及びペット玩具Ａ３について説
明する。上記音声処理装置Ａ２は，上記適応型音声認識
装置Ａ１の構成に加えて，それによって認識された認識
対象語に対応する発声を行う構成を有し，例えば犬や猫
などのペットを模擬したペット玩具Ａ３に組み込まれた
ものとして具体化される。ここで，図８に上記音声処理
装置Ａ２及びそれを組み込んだペット玩具Ａ３の概略構
成を示す。尚，図８では，上記更新パラメータ決定部６
は上記音声モデル更新部５に含まれて表されている。図
８に示す如く，上記音声処理装置Ａ２は，上記適応型音
声認識装置Ａ１に相当する構成に加えて，上記結果判定
部４により判定された認識対象語に基づいて，予め認識
対象語に対応して記憶された応答内容を選択し，上記入
力音声に対する応答を制御する応答制御部（応答制御手
段）８を具備しており，上記応答制御部８は，応答レベ
ル決定部（応答レベル決定手段に相当）８１と，応答決
定部８２とを備えている。上記応答レベル決定部８１
は，選択された応答内容が複数の応答レベルに分割され
ている場合に，上記類似度に基づいて上記応答レベルを
選択するためのものであり，上記応答決定部８２は一つ
の応答レベルに複数の応答が含まれている場合に，その
うちのいずれか選択するためのものである。また，上記
音声処理装置Ａ２を組み込んだペット玩具Ａ３には，犬
の尻尾などに対応する部位２００を動作させるための可
動部２０１，及びモータ２０２と，モータを駆動する駆
動部２０３と，犬や猫などの鳴き声などを模擬した音声
を合成するための音声合成部２０４と，音声合成部２０
４により合成された合成音声を音響信号として出力する
スピーカ２０５とが備えられており，上記駆動部２０３
や音声合成部２０４は上記応答決定部８２により決定さ
れた応答に従って制御される。Next, a speech processing device A2 and a pet toy A3 to which the adaptive speech recognition device A1 is applied will be described. The voice processing device A2 has a configuration in which, in addition to the configuration of the adaptive voice recognition device A1, a voice corresponding to the recognition target word recognized thereby is produced, and for example, a pet simulating a pet such as a dog or a cat It is embodied as being incorporated in the toy A3. Here, FIG. 8 shows a schematic configuration of the voice processing device A2 and a pet toy A3 incorporating the same. In FIG. 8, the update parameter determination unit 6
Are included in the voice model updating unit 5 and are represented. As shown in FIG. 8, in addition to the configuration corresponding to the adaptive speech recognition device A1, the speech processing device A2 is adapted to correspond to the recognition target word in advance based on the recognition target word determined by the result determination unit 4. And a response control unit (response control means) 8 for selecting a response content stored as a response and controlling a response to the input voice. The response control unit 8 includes a response level determination unit (response level determination means). ) 81 and a response determination unit 82. The response level determination unit 81
Is used to select the response level based on the similarity when the selected response content is divided into a plurality of response levels. When the response is included, it is for selecting one of them. The pet toy A3 incorporating the audio processing device A2 has a movable part 201 for operating a part 200 corresponding to the tail of a dog and the like, a motor 202, a driving part 203 for driving the motor, a dog A speech synthesizer 204 for synthesizing a voice simulating the sound of a cat or the like;
And a speaker 205 that outputs a synthesized voice synthesized by the P.4 as an audio signal.
The speech synthesizer 204 is controlled according to the response determined by the response determiner 82.

【００１８】上記ペット玩具Ａ３は，使用者からの入力
音声に対して，例えば上記尻尾などに対応する部位２０
０を動作させたり，「ワンワン」などの鳴き声を発声し
て，使用者に応答する。このような応答を行う際の上記
音声処理装置Ａ２，及びこれを組み込んだペット玩具Ａ
３の動作は以下の通りである。上記音声処理装置Ａ２，
及びペット玩具Ａ３において，上記適応型音声認識装置
Ａ１の結果判定部４から，入力音声に対して選択判定さ
れた認識対象語が判定されると，当該認識対象語と，当
該認識語の判定の際に用いた類似度が，応答レベル決定
部８１に供給される。認識の対象となっている認識対象
語には，予め対応する応答内容が記憶されている。上記
応答内容には，各種の鳴き声の発声や，特定部位の選択
又は同時可動などが含まれるが，各応答内容にはさらに
上記類似度に応じた応答レベルが設定されている。例え
ばこの応答レベルは，図９に示すようなものである。図
９の例では，３つのしきい値１ｔ’，２ｔ’，３ｔ’に
より４つの応答レベル１ｌ，２ｌ，３ｌ，４ｌが規定さ
れており，類似度が最も低いレベル１ｌ，即ちしきい値
１ｔ’より上記類似度が小さい場合にはマイク１０１か
ら音声が入力されても何の反応も行われず，上記応答レ
ベル２ｌ，即ちしきい値１ｔ’としきい値２ｔ’との間
に上記類似度がある場合には，「ワンワン」という発声
が設定されており，さらに類似度が大きくなるに連れ
て，「こんにちわ」，「こんにちわ，元気ですか」とい
う発声が設定されている。上記応答レベルは，基本的に
は階層が高くなるに連れて，応答がより積極的なものと
なる。尚，上記応答内容，及び応答レベルは，各認識対
象語ごとに設定してもよいし，ある特定の認識対象語に
対して設定してもよいし，全ての認識対象語に共通して
設定してもよい。The pet toy A3 responds to an input voice from the user by, for example, a part 20 corresponding to the tail or the like.
0 responds to the user by operating 0 or uttering a cry such as "one-one". The voice processing device A2 for making such a response and the pet toy A incorporating the same.
The operation of No. 3 is as follows. The above audio processing device A2
In the toy A3 and the pet toy A3, when the result determination unit 4 of the adaptive speech recognition apparatus A1 determines the recognition target word selected and determined for the input speech, the recognition target word and the recognition word are determined. The similarity used at this time is supplied to the response level determining unit 81. The corresponding response content is stored in advance for the recognition target word to be recognized. The above-mentioned response contents include utterances of various kinds of calls, selection of specific parts or simultaneous movement, and the like, and a response level according to the similarity is further set in each response content. For example, the response level is as shown in FIG. In the example of FIG. 9, four response levels 11, 21, 31, and 41 are defined by three thresholds 1 t ′, 2 t ′, and 3 t ′, and the level 11 1 having the lowest similarity, that is, the threshold 1 t If the similarity is smaller than ', no response is made even if a voice is input from the microphone 101, and the similarity is between the response level 2l, ie, the threshold 1t' and the threshold 2t '. In some cases, the utterance “Wan Wan” is set, and as the similarity increases, the utterances “Hello” and “Hello, how are you?” Are set. The response level basically becomes more aggressive as the hierarchy becomes higher. The response content and the response level may be set for each recognition target word, may be set for a specific recognition target word, or may be set commonly for all recognition target words. May be.

【００１９】また，上記類似度に対するしきい値１
ｔ’，２ｔ’，３ｔ’も固定されている必要はない。上
記ペット玩具Ａ３の場合，そのときどきに応じて応答の
仕方を変化させた方が，使用者を飽きさせず好適であ
る。上記しきい値１ｔ’，２ｔ’，３ｔ’の設定の変更
は，例えばランダムな変数に基づいて上記応答レベルの
上下関係を変化させないように行われる。上記ランダム
変数の生成には，例えば内部時計や装置の電圧値などを
用いることができるが，特に限定されるものではない。
このランダム変数を元のしきい値に加算したり，減算し
たりして，上記しきい値の設定を変化させる。このと
き，上記応答レベルの上下関係を変化させないために
は，例えばしきい値２ｔ’の上下にある他のしきい値１
ｔ’，３ｔ’を最小値及び最大値として設定し，上記ラ
ンダム変数が上記最小値以下となったり，上記最大値以
上となった場合には，上記最小値から上記最大値までの
範囲で当該しきい値２ｔ’がおさまるまで，ランダム変
数の生成を繰り返せばよい。これにより応答レベルの数
が少ない認識対象語が，使用者により発声された場合で
も，上記ペット玩具Ａ３の応答に多様性を持たせること
が可能となる。さらに，上記音声モデルの適応が進むに
従って装置使用者が同一の単語を発声した場合でも，上
記類似度は大きくなっていくことになるから，これに合
わせて上記しきい値も全体的に上昇させることにより，
他人の発声に対する拒絶効果を増大させることができ
る。これにより，ペットが飼い主に馴染み，飼い主以外
の発声に非積極的になる現象を模倣して，使用者のペッ
ト玩具に対する愛着をも高めることができる。尚，上記
拒絶効果は，下位にあるしきい値を徐々に上昇させると
効果的に高められる。Also, a threshold value 1 for the above similarity
t ', 2t', and 3t 'need not be fixed. In the case of the above-mentioned pet toy A3, it is preferable to change the way of responding according to the occasion so as not to bore the user. The setting of the thresholds 1t ', 2t', and 3t 'is changed so as not to change the vertical relationship of the response levels based on, for example, random variables. For generating the random variable, for example, an internal clock or a voltage value of a device can be used, but it is not particularly limited.
The setting of the threshold is changed by adding or subtracting this random variable to or from the original threshold. At this time, in order not to change the vertical relation of the response level, for example, another threshold 1 above and below the threshold 2t '
t 'and 3t' are set as a minimum value and a maximum value, and when the random variable is less than the minimum value or greater than the maximum value, the random variable is set in the range from the minimum value to the maximum value. The generation of the random variable may be repeated until the threshold value 2t 'falls. Thus, even when the recognition target word having a small number of response levels is uttered by the user, the response of the pet toy A3 can have a variety of responses. Furthermore, even if the device user utters the same word as the adaptation of the speech model progresses, the similarity increases, and accordingly the threshold value is also increased overall. By
The rejection effect on the utterance of another person can be increased. As a result, it is possible to imitate a phenomenon in which the pet becomes familiar with the owner and becomes inactive to utterances other than the owner, thereby increasing the user's attachment to the pet toy. The rejection effect can be effectively enhanced by gradually increasing the lower threshold value.

【００２０】また，上記しきい値１ｔ’，２ｔ’，３
ｔ’を上昇させるための指標としては，例えば内部時計
から取得された使用時間の合計や，認識対象語毎の適応
回数などを用いることができる。また，上記図９の例で
は，しきい値を３つ用いて上記類字度を応答レベルと対
応付けていたが，これに限られるものではなく，図１０
に示す如く，しきい値は一つだけ設定しておき，各応答
レベルに対応する係数１ｓ，２ｓ，３ｓ，４ｓ等を基に
上記類似度について演算処理を行い，各係数１ｓ，２
ｓ，３ｓ，４ｓを基に演算処理した上記類似度が上記一
つのしきい値を上回るか否かによって応答レベルを選択
するようにしてもよい。図１０の例では，係数３ｓと４
ｓを用いて演算処理した場合に上記しきい値を上記類似
度が上回っているが，例えばそのうち値の最も低いも
の，即ちレベル３ｌを選択するようにすればよい。この
ように一つのしきい値と各応答レベルにそれぞれ対応し
た係数を用いる場合でも，上記しきい値を複数用いる場
合と同様，音声が入力される度にランダムに上記係数を
変化させたり，認識対象語の適応度合いに応じて係数を
上昇させ，応答の多様性を確保することが可能である。
上記のようにして，上記応答レベル決定部８１では，上
記認識対象語，及び上記認識対象語の選択判定の際に用
いた類似度に基づいて上記応答レベルが決定され，この
応答レベルは応答決定部８２に供給される。The threshold values 1t ', 2t', 3
As an index for increasing t ′, for example, the total use time acquired from the internal clock, the number of adaptations for each recognition target word, and the like can be used. Further, in the example of FIG. 9 described above, the similarity is associated with the response level by using three thresholds, but the present invention is not limited to this.
As shown in (1), only one threshold value is set, and the similarity is calculated based on the coefficients 1s, 2s, 3s, 4s, etc. corresponding to the respective response levels.
The response level may be selected based on whether or not the similarity calculated based on s, 3s, and 4s exceeds the one threshold. In the example of FIG.
Although the similarity exceeds the threshold value when the arithmetic processing is performed using s, for example, the one having the lowest value, that is, the level 3l may be selected. Even when one threshold value and a coefficient corresponding to each response level are used, as in the case of using a plurality of threshold values, the coefficient is randomly changed or recognized every time speech is input. It is possible to increase the coefficient in accordance with the degree of adaptation of the target word to ensure a variety of responses.
As described above, the response level determination unit 81 determines the response level based on the recognition target word and the similarity used in the selection determination of the recognition target word, and the response level is determined by the response determination. It is supplied to the section 82.

【００２１】上記応答決定部８２では，使用者に対して
最終的に行われる応答が決定される，決定された応答に
従う制御信号が生成される。上記応答レベル決定部８１
から供給された応答レベルに対応する応答が一つの場合
には，そのまま上記駆動部２０３や音声合成部２０４に
上記制御信号を送出するが，上記応答レベルに対応する
応答が複数の場合もある。例えば上記「こんにちわ」と
いう発声内容が設定された応答レベル３ｌに対して，図
１１に示す如く，「びっくりした」，「なんですか？」
といった他の発声内容が設定されている場合である。ま
た，他の例として，上記尻尾に対応する部位２００を動
作させる応答内容が，その速度によって複数の応答レベ
ルに分けられている場合に，各応答レベルに対して，横
方向に動作させる，円を描くように動作させる，前後方
向に動作させるなど複数の応答が割り当てられている場
合がある。上記最終的な応答は，これらの複数の応答の
うちから例えばランダムに選択してもよいし，各応答に
対して応答回数を計数しておき，応答回数の少ないもの
を優先的に選択するようにしてもよい。そして，上記応
答決定部８２により応答が決定され，上記音声合成部２
０４や駆動部２０３に上記制御信号が送出されると，上
記制御信号に従って上記駆動部２０３や，上記音声合成
部２０４が動作させられ，実際の応答が行われる。この
ように上記音声処理装置を組み込んだペット玩具Ａ３で
は，応答の多様性を確保したり，飼い主へ馴染む現象を
模倣したりして，使用者との対話性を向上させることが
できる。The response deciding section 82 determines a response finally made to the user and generates a control signal according to the determined response. The response level determination unit 81
When there is only one response corresponding to the response level supplied from, the control signal is sent to the driving unit 203 and the voice synthesizing unit 204 without any change. For example, as shown in FIG. 11, "responsible" or "what is it?"
This is a case in which other utterance contents such as are set. Further, as another example, when the response content for operating the part 200 corresponding to the tail is divided into a plurality of response levels according to the speed, the operation is performed in the horizontal direction for each response level. There are cases where a plurality of responses are assigned, such as operating as if drawing an image or operating in the front-back direction. The final response may be selected, for example, at random from the plurality of responses, or the number of responses may be counted for each response, and a response with a small number of responses may be preferentially selected. It may be. Then, a response is determined by the response determination unit 82, and the speech synthesis unit 2
When the control signal is sent to the drive unit 04 and the drive unit 203, the drive unit 203 and the voice synthesis unit 204 are operated according to the control signal, and an actual response is made. As described above, in the pet toy A3 incorporating the voice processing device, it is possible to secure a variety of responses and to imitate a phenomenon familiar to the owner, thereby improving the interactivity with the user.

【００２２】ところで，上記のようなペット玩具Ａ３等
に上記適応型音声認識装置Ａ１を応用した場合，音声が
入力されて，上記音声モデル更新部５により更新された
上記音声モデルが，揮発性のワークメモリ１０６にある
にもかかわらず，不意に電源がオフされてしまう恐れが
ある。従って，不意の電源オフの際に，できるだけ上記
更新された音声モデルが保護されるように構成する必要
があるし，もし消失してしまっても，何らかの音声モデ
ルを保持しておく必要がある。このために，上記適応型
音声認識装置Ａ１では，初期の音声モデルは不揮発性で
読出専用のＲＯＭ１０４に記憶され，上記更新による上
記音声モデルの変位分のデータだけが，不揮発性で書き
換え可能なフラッシュメモリ１０５に記憶される。そし
て，上記音声モデル，及び上記音声モデルの変位分のデ
ータを用いる際には，両者が加算された状態で上記プロ
セッサ１０３が用いるワークメモリ１０６に転送され
る。一方，上記ワークメモリ１０６から退避させる際に
は，上記ワークメモリ１０６に保持されている更新済の
音声モデルから，上記初期の音声モデルが差し引かれた
上記変位分のデータが上記フラッシュメモリ１０５に転
送される。尚，上記フラッシュメモリ１０５に転送され
る上記変位分のデータは基本的に上記フラッシュメモリ
１０５の１ブロック内に書き込まれ，消去される際には
ブロック内の全てのデータが一括消去される。When the adaptive speech recognition device A1 is applied to the pet toy A3 or the like as described above, a speech is input, and the speech model updated by the speech model updating unit 5 is used as a volatile model. There is a risk that the power may be unexpectedly turned off despite being in the work memory 106. Therefore, it is necessary to protect the updated voice model as much as possible when the power is suddenly turned off, and it is necessary to keep some voice model even if it disappears. For this reason, in the adaptive speech recognition apparatus A1, the initial speech model is stored in the non-volatile, read-only ROM 104, and only the data corresponding to the displacement of the speech model by the update is stored in the non-volatile, rewritable flash memory. It is stored in the memory 105. When the voice model and the data corresponding to the displacement of the voice model are used, the data is transferred to the work memory 106 used by the processor 103 in a state where both are added. On the other hand, when evacuation is performed from the work memory 106, data of the displacement obtained by subtracting the initial voice model from the updated voice model held in the work memory 106 is transferred to the flash memory 105. Is done. The displacement data transferred to the flash memory 105 is basically written in one block of the flash memory 105, and when erasing, all data in the block is collectively erased.

【００２３】以下，より具体的に上記電源オフの際の上
記音声モデル，及びその変位分のデータの記録処理につ
いて説明する。ここで，図１２は上記フラッシュメモリ
のある一ブロックへ上記変位分のデータが正常に転送さ
れた場合の様子を説明するための図であり，図１３は上
記フラッシュメモリのある一ブロックへ上記変位分のデ
ータが正常に転送されなかった場合の様子を説明するた
めの図である。電源スイッチがオフされたりして使用者
により電源のオフが指示されると，図１２（ｃ）及び図
１３（ｃ）に示す如く，まず揮発性で書き換え可能なワ
ークメモリ１０６から，そのときワークメモリ１０６に
保持されている上記音声モデルのうち初期の音声モデル
の値が差し引かれた上記変位分のデータだけが，フラッ
シュメモリ１０５のある一ブロックに退避させられる。
上記変位分のデータの退避が終了すると，さらに上記変
位分のデータが正常に退避されたか否かを判別するため
のチェックサムなどのエラーチェック用データが当該ブ
ロックの最後尾に書き込まれ，その後になって電源回路
がシャットダウンされる。電源を再びオンする際には，
まず，上記フラッシュメモリに書き込まれた上記変位分
のデータが正常であるか否かが上記プロセッサにより上
記エラーチェック用データを用いて判別が行われる。こ
のとき，正常であると判別されれば，図１２（ａ）に示
す如く，上記ＲＯＭ１０４からの初期の音声モデルと，
上記フラッシュメモリ１０５の当該ブロックからの上記
変位分のデータとが加算された後，上記ワークメモリ１
０６へ転送される。そして，上記変位分のデータが上記
ワークメモリ１０６に転送されると，図１２（ｂ）に示
す如く，上記フラッシュメモリ１０５では上記変位分の
データを保持していた当該ブロックがブロック消去され
る。一方，正常でないと判別されれば，図１３（ａ）に
示す如く，上記ＲＯＭ１０４に記憶されている初期の音
声モデルのみが上記ワークメモリ１０６へ転送され，上
記変位分のデータについては所定の値が設定される。ま
た，このとき図１３（ｂ）に示す如く，上記変位分のデ
ータの内容を保持していた当該ブロックがブロック消去
される。このように，初期の音声モデルを不揮発性で読
出専用のＲＯＭ１０４に，変位分のデータを不揮発性で
書き換え可能なフラッシュメモリ１０５に記憶してお
き，必要時にワークメモリ１０６に転送するよう構成す
ることにより，電源オフの際に誤って音声モデル全体が
失われてしまうことを防止することができる。The recording process of the voice model when the power is turned off and the data corresponding to the displacement will be described more specifically. Here, FIG. 12 is a diagram for explaining a state in which the data of the displacement is normally transferred to one block of the flash memory, and FIG. 13 is a diagram for explaining the displacement of the data to one block of the flash memory. FIG. 9 is a diagram for explaining a state in which data for a minute has not been transferred normally. When the user instructs to turn off the power by turning off the power switch or the like, first, as shown in FIGS. 12C and 13C, the volatile and rewritable work memory 106 is used. Only the data corresponding to the displacement from which the value of the initial voice model is subtracted from the voice models stored in the memory 106 is saved in one block of the flash memory 105.
When the displacement data is completely saved, error check data such as a checksum for determining whether the displacement data has been normally saved is written to the end of the block. And the power supply circuit is shut down. When turning on the power again,
First, the processor determines whether or not the data of the displacement written in the flash memory is normal by using the error check data. At this time, if it is determined that the sound model is normal, as shown in FIG.
After adding the displacement data from the block of the flash memory 105, the work memory 1
06. When the data for the displacement is transferred to the work memory 106, the block holding the data for the displacement in the flash memory 105 is erased as shown in FIG. 12B. On the other hand, if it is determined that it is not normal, only the initial voice model stored in the ROM 104 is transferred to the work memory 106 as shown in FIG. Is set. At this time, as shown in FIG. 13 (b), the block holding the data content of the displacement is erased. In this manner, the initial voice model is stored in the nonvolatile read-only ROM 104, the displacement data is stored in the nonvolatile rewritable flash memory 105, and the data is transferred to the work memory 106 when necessary. Thus, it is possible to prevent the entire voice model from being accidentally lost when the power is turned off.

【００２４】また，上記の例では，上記フラッシュメモ
リ１０５のある一ブロックのみを用いて上記変位分のデ
ータ等を記憶していたが，もちろん上記フラッシュメモ
リ１０５の容量に余裕があれば２ブロック以上を用いて
処理を行うことも可能である。ここで，図１４は上記フ
ラッシュメモリの２ブロックを用いたときに，最新の上
記変位分のデータが正常に転送された場合の様子を説明
するための図であり，図１５は上記フラッシュメモリの
２ブロックを用いたときに，最新の上記変位分のデータ
が正常に転送されなかった場合の様子を説明するための
図である。電源オフの際には，最新の更新に係る変位分
のデータが，上記フラッシュメモリ１０５の２つのブロ
ックのうちで，使用されていない一つのブロック，例え
ばブロックＢ１に記憶される。この際，フラッシュメモ
リ１０５の当該ブロックＢ１の所定位置，例えばブロッ
クＢ１の最後尾に，チェックサムなどのエラーチェック
用データ，及びフラッシュメモリ１０５の初期化後の書
き込み回数データ（Ｎ−１）が格納される。次に電源を
オンする際には，図１４（ａ）及び図１５（ａ）に示す
如く，まずフラッシュメモリ１０５の各ブロックＢ１，
Ｂ２について，上記所定の位置にある書き込み回数デー
タ（Ｎ−１），（Ｎ−２）が参照され，そのうち回数の
大きい，即ち最近更新されたブロックＢ１が読み出し対
象に設定される。そして，上記読み出し対象に設定さた
ブロックＢ１のエラーチェック用データが参照され，当
該ブロックＢ１に記憶されている変位分のデータが正常
であるか否かの判別が上記プロセッサにより行われる。
図１４（ａ）に示す如く，上記ブロックＢ１に記憶され
ている変位分のデータが正常であると判別されれば，上
記変位分のデータが上記ＲＯＭ１０４からの初期の音声
モデルと加算された後上記ワークメモリ１０６に転送さ
れる。さらに，上記ワークメモリ１０６へ上記変位分の
データが転送されると，図１４（ｂ）に示す如く，上記
書き込み回数データ（Ｎ−１），（Ｎ−２）のうち，回
数の少ないブロックＢ２に記憶されていた上記変位分の
データ（前々回の更新に対応する）が消去される。そし
て，再び電源がオフされる場合には，最新の更新に係る
変位分のデータが，上記ブロック消去により消去状態
（０×ＦＦＦＦ）となったブロックＢ２に記憶される。
この際，フラッシュメモリ１０５の当該ブロックＢ２の
所定位置，例えばブロックＢ２の最後尾に，チェックサ
ムなどのエラーチェック用データ，及びフラッシュメモ
リ１０５の初期化後の書き込み回数データＮが格納され
る。In the above example, the displacement data and the like are stored using only one block of the flash memory 105. However, if the flash memory 105 has a sufficient capacity, two or more blocks may be used. It is also possible to perform the processing by using. Here, FIG. 14 is a diagram for explaining a state in which the latest data of the displacement is normally transferred when two blocks of the flash memory are used, and FIG. 15 is a diagram of the flash memory. FIG. 10 is a diagram for explaining a state in which the latest data of the displacement is not transferred normally when two blocks are used. When the power is turned off, the displacement data related to the latest update is stored in one of the two blocks of the flash memory 105 that is not used, for example, the block B1. At this time, error check data such as a checksum and write count data (N-1) after initialization of the flash memory 105 are stored at a predetermined position of the block B1 in the flash memory 105, for example, at the end of the block B1. Is done. Next, when the power is turned on, as shown in FIGS. 14A and 15A, first, each block B1,
Regarding B2, the number-of-times-of-writing data (N-1) and (N-2) at the above-mentioned predetermined position are referred to, and the block B1 having the largest number of times, that is, the block B1 that has been updated most recently is set as a read target. Then, the error checking data of the block B1 set as the read target is referred to, and the processor determines whether or not the displacement data stored in the block B1 is normal.
As shown in FIG. 14A, if it is determined that the displacement data stored in the block B1 is normal, the displacement data is added to the initial voice model from the ROM 104. The data is transferred to the work memory 106. Further, when the data corresponding to the displacement is transferred to the work memory 106, as shown in FIG. 14 (b), the block B2 having the smaller number of times out of the write number data (N-1) and (N-2). The data corresponding to the displacement (corresponding to the update two times before) stored in the memory is deleted. Then, when the power is turned off again, the data of the displacement related to the latest update is stored in the block B2 which has been erased (0 × FFFF) by the block erasure.
At this time, error check data such as a checksum and the number-of-writes data N after initialization of the flash memory 105 are stored at a predetermined position of the block B2 in the flash memory 105, for example, at the end of the block B2.

【００２５】一方，図１５（ａ）に示す如く，上記ブロ
ックＢ１に記憶されている上記変位分のデータが正常で
ないと判別された場合には，それよりも初期化後の書き
込み回数が一つ少ないブロックＢ２が読み出し対象に設
定される。上記ブロックＢ２に記憶されている前々回の
変位分のデータについては既に正常であることが確認さ
れていることから，上記ＲＯＭ１０４からの初期の音声
モデルと加算されて上記ワークメモリ１０６へ転送され
る。さらに，上記ブロックＢ１に正常に記憶されなかっ
た前回のデータは不要であるから，図１５（ｂ）に示す
如く，上記ブロックＢ１についてブロック消去が行わ
れ，上記消去状態にされる。そして，再び電源をオフす
る際には，上記フラッシュメモリ１０５の各ブロックＢ
１，Ｂ２について，所定の位置にある書き込み回数デー
タが確認され，そのうち値が０×ＦＦＦＦ（消去状態）
にあるブロックＢ１について上記変位分のデータが書き
込まれ，当該ブロックＢ１の最後尾にエラーチェック用
データと書き込み回数Ｎが格納される。上記のように上
記変位分のデータの格納に上記フラッシュメモリ１０５
の２つ以上のブロックを用いる場合，最新のものよりも
一つ前の変位分のデータを用いることになるが，上記適
応型音声認識装置では，それほど更新度合いが高く設定
されないので，音声認識処理に与える影響は小さい。そ
の結果，電源が不意にオフされ，上記ワークメモリ１０
６内にある最新の更新に係る音声モデルが消失しても，
学習をやり直す必要がほとんどなくなる。On the other hand, as shown in FIG. 15A, when it is determined that the data corresponding to the displacement stored in the block B1 is not normal, the number of times of writing after initialization is one more. A small number of blocks B2 are set as read targets. Since it is already confirmed that the data of the displacement before last stored in the block B2 is normal, the data is added to the initial voice model from the ROM 104 and transferred to the work memory 106. Further, since the previous data that has not been normally stored in the block B1 is unnecessary, as shown in FIG. 15B, the block B1 is erased and the erased state is set. When the power is turned off again, each block B of the flash memory 105 is turned off.
For 1 and B2, the number-of-writes data at a predetermined position is confirmed, and the value is 0 × FFFF (erase state).
The data for the above displacement is written for the block B1 in the block B1, and the error check data and the number of times of writing N are stored at the end of the block B1. As described above, the flash memory 105 is used to store the displacement data.
When two or more blocks are used, the data of the displacement one immediately before the latest one is used. However, in the above adaptive speech recognition device, the updating degree is not set so high, so the speech recognition processing is not performed. Has a small effect. As a result, the power is suddenly turned off, and the work memory 10 is turned off.
Even if the voice model related to the latest update in 6 disappears,
There is almost no need to re-learn.

【００２６】また，上記のような電源オフの際の動作に
加えて，上記ペット玩具などへの上記適応型音声認識装
置，及び音声処理装置の応用性を向上させるためには，
できるだけ使用者に意識させない形で，上記音声モデル
の更新を行う必要がある。そこで，次に上記適応型音声
認識装置Ａ１’，及び音声処理装置の変形例Ａ２’を備
えた上記ペット玩具の変形例Ａ３’について説明する。
ここで，上記変形例に係るペット玩具Ａ３’の概略構成
を図１６に，上記適応型音声認識装置Ａ１’，音声処理
装置Ａ２’，ペット玩具Ａ３’に係る機能ブロックを図
１７にそれぞれ示す。図１６及び図１７に示す如く，上
記ペット玩具Ａ３’，音声処理装置Ａ２’が備える適応
型音声認識装置Ａ１’は，上記適応型音声認識装置Ａ１
の構成に加えて，発声要求決定部９と，認識結果履歴記
憶部１０とをさらに具備する。また，上記適応型音声認
識装置Ａ１におけるスイッチ（モード選択手段）７とし
て，持ち主が愛情表現のために自然に触る傾向にある上
記ペット玩具Ａ３’の口にあたる部分や，前足にあたる
部分などにスイッチ７１，７２が設けられている。尚，
上記適応型音声認識装置Ａ１’，音声処理装置Ａ２’，
駆動部２０３，及び音声合成部２０４などを含む制御装
置３００は，図１６に示す如く，電源回路３０１，又は
電池３０２から電源供給されて作動する。上記発声要求
決定部（発声要求決定手段に相当）９は，特定の認識対
象語の発声を使用者に促すためのメッセージを決定する
ものであり，上記発声要求決定部９により決定されたメ
ッセージは適切なタイミングで上記音声合成部２０４に
て音声合成され，スピーカ２０５から出力される。例え
ば上記発声要求決定部９が決定するメッセージは，「コ
ンニチワといってください」「オイデといってくださ
い」などのメッセージである。このメッセージが，上記
ペット玩具Ａ３’から発声されると，それから少しの間
に持ち主が「コンニチワ」や「オイデ」といった発声を
行う可能性が高まる。このため，上記発声要求決定部９
によって決定されたメッセージが出力されてから，一定
時間以内に音声検出部１０７により音声が検出された場
合には，その入力音声に対しては，上記更新パラメータ
決定部６により決定される更新パラメータｋの値が大き
く設定する。これにより，無意味な音声などによる上記
音声モデルの更新を避けながら，効果的に且つ適切な上
記モデルの更新を行うことが可能となる。In addition to the above-mentioned power-off operation, in order to improve the applicability of the adaptive speech recognition device and the speech processing device to the pet toy and the like,
It is necessary to update the above speech model in a manner that makes it as transparent as possible to the user. Therefore, a modification A3 'of the pet toy including the adaptive speech recognition device A1' and the modification A2 'of the speech processing device will be described next.
Here, FIG. 16 shows a schematic configuration of a pet toy A3 'according to the modified example, and FIG. 17 shows functional blocks of the adaptive speech recognition device A1', the voice processing device A2 ', and the pet toy A3'. As shown in FIGS. 16 and 17, the adaptive speech recognition device A1 'included in the pet toy A3' and the speech processing device A2 'is the adaptive speech recognition device A1'.
In addition to the configuration described above, an utterance request determination unit 9 and a recognition result history storage unit 10 are further provided. The switch (mode selection means) 7 in the adaptive speech recognition apparatus A1 may be a switch 71 for a part corresponding to the mouth of the pet toy A3 ', whose owner tends to touch naturally for expressing affection, or a part corresponding to the forefoot. , 72 are provided. still,
The adaptive speech recognition device A1 ', speech processing device A2',
The control device 300 including the driving unit 203 and the voice synthesizing unit 204 is operated by being supplied with power from the power supply circuit 301 or the battery 302 as shown in FIG. The utterance request determination unit (corresponding to an utterance request determination unit) 9 determines a message for prompting the user to utter a specific recognition target word, and the message determined by the utterance request determination unit 9 is The voice is synthesized by the voice synthesizing unit 204 at an appropriate timing, and output from the speaker 205. For example, the message determined by the utterance request determination unit 9 is a message such as "Please say Konnichiwa" or "Please say Oide". When this message is uttered from the pet toy A3 ', the possibility that the owner utters "Connichiwa" or "Oide" in a short time is increased. Therefore, the utterance request determination unit 9
If the voice is detected by the voice detection unit 107 within a predetermined time after the message determined by the update parameter k is output, the update parameter k determined by the update parameter determination unit 6 is applied to the input voice. Set a large value for. This makes it possible to update the model effectively and appropriately while avoiding updating of the voice model due to meaningless voices and the like.

【００２７】このとき，上記更新パラメータｋの値は，
入力音声に対する類似度に従って変化させてもよいし，
無関係な例えば一定値に決定してもよい。さらに，メッ
セージが出力されてから経過した時間に従って上記更新
パラメータｋの値を小さくするようにしてもよい。これ
は時間が経過すると，上記メッセージによって促された
認識対象語が発声される可能性が低くなるためであり，
効果的かつ適切な上記モデルの更新を行う他の手法であ
る。このような更新パラメータｋの決定は，上述の学習
モード時における更新パラメータｋに対応する。しかし
ながら，上記のような内容のメッセージでも，出力する
タイミングによっては，不自然さが際立ってしまう。そ
こで，上記メッセージを出力させる，即ち学習モードに
移行するための上記スイッチ７（スイッチ７１，７２）
を，持ち主がペットに愛情表現をするときに自然に触れ
るような，頭や耳の裏など，撫でる際に触れる部分や，
口の中など餌を与える際に触れる部分，お手などをさせ
る際に触れる部分などに設けておく。さらに，上記メッ
セージにより，どのような認識対象語の発声を促すかも
重要である。もちろん，上記発声要求決定部９において
ランダムに対象となる認識対象語を選択するようにして
もよいが，同じような認識対象語の発声ばかりを促して
いても，ある程度適応が進むと音声入力回数に対する学
習効果が低減してしまうからである。例えば上記更新パ
ラメータｋは，学習モードにおいても上記類似度に従っ
て決定されることが多い。これは，発声を促すメッセー
ジを出力した後でも，無用な発声が入力される恐れがあ
り，それによる上記音声モデルに対する悪影響を避ける
ためである。従って，ある認識対象語に対する初期の音
声モデルが，持ち主とは合っていないような場合，類似
度は低くなって上記更新パラメータｋの値が小さく設定
されてしまう。しかしながら，このような場合には，逆
に積極的に上記音声モデルの更新を行った方がよい。At this time, the value of the update parameter k is
It may be changed according to the similarity to the input voice,
Irrelevant, for example, a fixed value may be determined. Further, the value of the update parameter k may be reduced according to the time elapsed since the message was output. This is because, over time, the recognition target word prompted by the above message is less likely to be uttered,
This is another method for effectively and appropriately updating the model. Such determination of the update parameter k corresponds to the update parameter k in the learning mode described above. However, even with a message having the above content, unnaturalness becomes noticeable depending on the output timing. Therefore, the switch 7 (switches 71 and 72) for outputting the message, that is, for shifting to the learning mode.
The part that touches when petting, such as the back of the head or ears, that the owner touches naturally when expressing affection for the pet,
It should be provided in the mouth and other parts that are touched when feeding and when touched by hands. It is also important what kind of recognition target words are uttered by the above message. Of course, the utterance request determining unit 9 may randomly select a target recognition target word. However, even if the user only urges the utterance of the similar recognition target word, the adaptation proceeds to some extent, and This is because the learning effect on is reduced. For example, the update parameter k is often determined according to the similarity even in the learning mode. This is because unnecessary utterance may be input even after the message prompting the utterance is output, thereby avoiding adverse effects on the speech model. Therefore, if the initial speech model for a certain recognition target word does not match the owner, the similarity decreases and the value of the update parameter k is set to a small value. However, in such a case, it is better to actively update the voice model.

【００２８】このために，上記適応型音声認識装置Ａ
１’は，上記評価値計算部３による類似度や上記結果判
定部４による認識対象語の過去の履歴を保持する認識結
果履歴記憶部１０を備えており，この場合には，上記発
声要求決定部９は，上記認識結果履歴記憶部１０に記憶
されている過去の認識対象語の中で，類似度が低い認識
対象語を選択して，それに対応したメッセージを決定す
る。また，上記認識結果履歴記憶部１０が，過去の認識
対象語の選択判定回数をも記憶しておき，上記発声要求
決定部９が，上記選択判定回数が少ない認識対象語に対
応するメッセージを優先的に決定するようにしてもよ
い。このとき，上記類似度の低さや上記選択判定回数の
少なさに応じて上記更新パラメータｋの値を大きく決定
するようにして当該認識対象語に対する学習を促進する
ようにしてもよい。これにより，複数の認識対象語に対
する偏りのない適切かつ効果的な学習が可能となる。ま
た，上記のようなペット玩具Ａ３’の場合，予め登録さ
れている単語（第１のモデルに属する音声モデルに対応
する単語）を持ち主が発声するのを認識して応答するだ
けでなく，持ち主が独自に決めた単語（第２のモデルに
属する音声モデルに対応する単語）に応答する必要があ
る場合もある。例えば，持ち主がペットに付ける名前に
は，珍しいものや，全くの創造語もあるが，持ち主がペ
ット玩具に対して名前を発声したときには，応答する必
要性が極めて高い。しかしながら，持ち主が独自に登録
した単語に対しては，学習を進めない限り極めて認識率
が低くなってしまう。このため，独自の単語に対して
は，予め登録されている単語よりも，上記更新パラメー
タｋの値を大きく設定しておく必要がある。さらに，上
記独自の単語に対応したメッセージを上記発声要求決定
部９が決定する頻度を高くすることも好ましい。また，
無用な発声や誤った音声検出によってモデルが更新され
るのを避ける方法として，複数の単語の組合せにより上
記適応型音声認識装置，音声処理装置，ペット玩具を操
作するようにすることも好適である。即ち，複数の単語
に関する類似度が一定のしきい値よりも大きい場合に，
上記更新パラメータｋを大きく設定する。これにより，
たまたま登録されている単語と類似度が高い入力信号が
検出された場合でも，同時に入力された他の単語の類似
度が低ければ，上記更新パラメータｋは小さく設定され
る。また，例えばペット玩具に対する呼びかけの名称
（独自に登録されたもの）とコマンドなどの初期登録単
語との２つの単語を「シロおいで」などのように組合
せ，上記独自の単語の一定時間以内に上記初期登録単語
が認識された場合にのみ，上記モデル更新を行うように
してもよい。For this purpose, the adaptive speech recognition apparatus A
1 ′ is provided with a recognition result history storage unit 10 for holding the similarity by the evaluation value calculation unit 3 and the past history of the recognition target word by the result determination unit 4, and in this case, the utterance request determination The unit 9 selects a recognition target word having a low degree of similarity from past recognition target words stored in the recognition result history storage unit 10 and determines a message corresponding thereto. In addition, the recognition result history storage unit 10 also stores the number of past selection judgments of the recognition target word, and the utterance request determination unit 9 gives priority to the message corresponding to the recognition target word having the small number of selection judgments. Alternatively, it may be determined. At this time, the learning of the recognition target word may be promoted by determining a large value of the update parameter k in accordance with the low similarity or the small number of selection determinations. As a result, it is possible to perform appropriate and effective learning without bias for a plurality of recognition target words. Also, in the case of the pet toy A3 'as described above, the owner not only recognizes and responds that the owner utters a pre-registered word (a word corresponding to the voice model belonging to the first model) but also the owner. May need to respond to uniquely determined words (words corresponding to speech models belonging to the second model). For example, the name given to the pet by the owner may be unusual or completely creative, but when the owner utters the name to the pet toy, the need to respond is extremely high. However, the recognition rate of a word uniquely registered by the owner becomes extremely low unless learning is advanced. For this reason, it is necessary to set the value of the update parameter k larger for a unique word than for a word registered in advance. Further, it is preferable to increase the frequency at which the utterance request determination unit 9 determines a message corresponding to the unique word. Also,
As a method of avoiding the model from being updated due to unnecessary utterance or erroneous voice detection, it is also preferable to operate the adaptive speech recognition device, the voice processing device, and the pet toy using a combination of a plurality of words. . In other words, if the similarity of multiple words is greater than a certain threshold,
The update parameter k is set large. This gives
Even when an input signal having a high similarity to a registered word is detected, if the similarity of another word that has been input at the same time is low, the update parameter k is set to a small value. Also, for example, a combination of two words, such as the name of a call to a pet toy (independently registered) and an initial registration word such as a command, such as "Shiro Oide," is used within a certain time of the unique word. The model update may be performed only when the initial registration word is recognized.

【００２９】上記のような上記適応型音声認識装置Ａ
１’における処理手順の一例を，図１８及び図１９を参
照して以下説明する。ここで，図１８は上記適応型音声
認識装置Ａ１’における発声要求の決定を説明するフロ
ーチャートであり，図１９は上記適応型音声認識装置Ａ
２’におけるモデル更新の決定を説明するフローチャー
トである。図１８に示す如く，上記発声要求決定部９に
よる発声要求の決定は，あるスイッチ７１，７２が押さ
れたタイミングや，音声認識結果により１個の単語が検
出され，その類似度が計算されたタイミングで行われ
る。このタイミングが，決定原因がどちらであるかが判
別される（Ｓ８０１）。上記工程Ｓ８０１において，ス
イッチ押し下げによると判別された場合，上記発声要求
決定部９により上記認識結果履歴記憶部１０が参照さ
れ，各認識対象語の過去の選択判定回数や類似度に基づ
いて，学習進行度合いＫｉが計算される（Ｓ８０２）。
この学習進行度合いＫｉは，単語ｉの学習が進行するに
つれて小さくなる正値である。次に，予め登録されてい
る各認識対象語に対する要求頻度が，上記学習進行度合
いＫｉに基づいて定められる（Ｓ８０３）。上記要求頻
度は，所定のパラメータＦ１と上記学習進行度合いＫｉ
の乗算により定められるが，上記パラメータは一定の値
であってもよいし，より一般的に学習の進行度合いが遅
い単語ほど要求頻度が大きくなるような関数としてもよ
い。さらに，持ち主により独自に登録された各認識対象
語に対する要求頻度が，上記学習進行度合いＫｉと，上
記パラメータＦ１とは異なる値のパラメータＦ２とに基
づいて定められる（Ｓ８０４）。上記パラメータＦ１，
パラメータＦ２を適当に調整することにより，予め登録
されている各認識対象語よりも，独自に登録された各認
識対象語に対する要求頻度を高くすることができる。最
終的には，定めた頻度に応じて乱数などを用いて認識対
象語が選択され，選ばれた認識対象語の発声を促すメッ
セージが決定される（Ｓ８０５）。尚，全ての単語につ
いて学習進行度合いがある程度進んでいる場合には，単
語の発声要求はしないような判別を上記処理に加えても
よい。一方，発声要求決定部９が，ある単語が認識され
たタイミングで動作する場合には，上記学習進行度合い
Ｋｉと同様な学習の進行度合いを表す値Ｋが求められ
（Ｓ８０６），上記値Ｋが，あるしきい値Ｋ０と比較さ
れる（Ｓ８０７）。上記値Ｋが，あるしきい値Ｋ０より
も大きい場合，即ちある単語に対する学習進行度合いが
進んでいない場合には，その単語が発声促す単語として
選択される（Ｓ８０８）。また，上記値Ｋがあるしきい
値Ｋ０よりも小さい場合には，ある単語に対する学習進
行度合いがある程度進んでいることになるから，要求な
しの決定が行われる（Ｓ８０９）。ここで，上記値Ｋに
は，簡単のため，上記類似度とは逆の関係にある，単語
モデルと検出された音声の特徴量との間の距離をそのま
ま用いてもよい。The above adaptive speech recognition apparatus A as described above
An example of the processing procedure in 1 'will be described below with reference to FIGS. Here, FIG. 18 is a flowchart for explaining the determination of the utterance request in the adaptive speech recognition apparatus A1 ', and FIG.
It is a flowchart explaining the determination of model update in 2 '. As shown in FIG. 18, when the utterance request is determined by the utterance request determination unit 9, one word is detected based on the timing at which certain switches 71 and 72 are pressed and the speech recognition result, and the similarity is calculated. It is done at the timing. It is determined which of the timings is the cause of the determination (S801). In step S801, when it is determined that the switch is pressed down, the utterance request determination unit 9 refers to the recognition result history storage unit 10 and performs learning based on the number of past selection determinations and similarities of each recognition target word. The degree of progress Ki is calculated (S802).
This learning progress degree Ki is a positive value that decreases as the learning of the word i progresses. Next, the request frequency for each registered recognition target word is determined based on the learning progress degree Ki (S803). The request frequency is determined by a predetermined parameter F1 and the learning progress degree Ki.
The parameter may be a constant value, or more generally, a function such that the request frequency increases as the word progresses slowly. Further, the request frequency for each recognition target word uniquely registered by the owner is determined based on the learning progress degree Ki and the parameter F2 having a value different from the parameter F1 (S804). The above parameters F1,
By appropriately adjusting the parameter F2, the request frequency for each uniquely registered recognition target word can be made higher than each previously registered recognition target word. Finally, a recognition target word is selected using a random number or the like according to the determined frequency, and a message prompting the utterance of the selected recognition target word is determined (S805). If the learning progress degree of all the words has progressed to some extent, a determination that the word is not requested to be spoken may be added to the above processing. On the other hand, when the utterance request determination unit 9 operates at the timing when a certain word is recognized, a value K representing the learning progress degree similar to the learning progress degree Ki is obtained (S806), and the value K is determined. , Is compared with a certain threshold value K0 (S807). If the value K is larger than a certain threshold value K0, that is, if the learning progress degree for a certain word is not advanced, the word is selected as a word that prompts utterance (S808). If the value K is smaller than a certain threshold value K0, it means that the learning progress degree for a certain word has progressed to some extent, so that a determination that there is no request is made (S809). Here, for the value K, for simplicity, the distance between the word model and the feature amount of the detected speech, which has the opposite relationship to the similarity, may be used as it is.

【００３０】次に，更新パラメータ決定部６において，
モデル更新を行うか否かを判別する際の処理の一例に関
して図１９を用いて説明する。図１９に示す如く，上記
更新パラメータ決定部６では，モデル更新を行うか否か
の判定と，上記更新パラメータｋの値の決定とが行われ
る。このために，まず学習の進行度合いＫの計算が行わ
れ（Ｓ９０１），さらに更新パラメータ決定部６が，発
声要求メッセージに応じて発声された音声によって動作
したのか，通常の音声認識結果に基づいて動作したの
か，即ち学習モードにあるか，通常モードにあるかが判
別される（Ｓ９０２）。上記工程Ｓ９０２において，学
習モードにあると判別された場合，発声を促した認識対
象語が，予め登録されたものか，持ち主により独自に登
録されたものかが判別される（Ｓ９０３）。そして，予
め登録されたものか，持ち主により独自に登録されたも
のかによって，それぞれ類似度のしきい値Ｔ１×Ｋ，Ｔ
２×Ｋ，及び更新パラメータｋの値Ｓ１×Ｋ，Ｓ２×Ｋ
が決定される（Ｓ９０４，Ｓ９０５）。一方，上記工程
Ｓ９０２において，通常モードにあると判別された場合
には，音声認識された認識対象語が，予め登録されたも
のか，持ち主により独自に登録されたものかが判別され
る（Ｓ９０６）。そして，予め登録されたものか，持ち
主により独自に登録されたものかによって，それぞれ類
似度のしきい値Ｔ３×Ｋ，Ｔ４×Ｋ，及び更新パラメー
タｋの値Ｓ３×Ｋ，Ｓ４×Ｋが決定される（Ｓ９０７，
Ｓ９０８）。尚，上記パラメータＴ１乃至Ｔ４，Ｓ１乃
至Ｓ４は，それぞれ類似度，及び更新パラメータに対す
る調整を行うためのものである。そして，上記工程Ｓ９
０４，Ｓ９０５，Ｓ９０７，Ｓ９０８において計算され
た類似度のしきい値と，更新の対象となっている認識対
象語に対する類似度が比較される（Ｓ９０９）。上記類
似度のしきい値よりも上記更新の対象となっている認識
対象語の類似度が大きい場合には，上記工程Ｓ９０４，
Ｓ９０５，Ｓ９０７，Ｓ９０８において計算された更新
パラメータｋが採用される（Ｓ９１０）。また，上記類
似度のしきい値よりも上記更新の対象となっている認識
対象語の類似度が小さい場合には，モデル更新を行わな
いことが決定される，即ち上記更新パラメータｋが０に
設定される（Ｓ９１１）。このように本実施の形態に係
る適応型音声認識装置，音声処理装置，及びペット玩具
では，無意味な発声や誤った発声によって上記音声モデ
ルの更新に与えられる悪影響が抑えられ，しかも類似度
や学習の進み具合などに応じて上記音声パラメータの値
が決定されるため，認識対象語への学習の偏りを防止し
たり，必要な時には迅速且つ適切に学習を行わせること
ができる。また，電源が不意にオフされた場合でも，全
ての音声モデルが消失することが防止され，例えばフラ
ッシュメモリのブロックを２つ以上用いた場合には，ほ
とんどの場合で悪くとも前々回の更新分のデータを用い
ることができるため，更新された音声モデルが消失した
ときでも学習のやり直しを行う必要がほとんどなくな
る。また，上記音声パラメータの調整によって，ペット
が飼い主に馴染んでいく現象や，飼い主の発声のみに積
極的に応えるなどの模倣を上記ペット玩具に行わせるこ
とができる。Next, in the update parameter determination unit 6,
An example of a process for determining whether to update a model will be described with reference to FIG. As shown in FIG. 19, the update parameter determination unit 6 determines whether or not to update the model, and determines the value of the update parameter k. For this purpose, first, a learning progress degree K is calculated (S901), and further, the update parameter determination unit 6 operates based on a voice uttered in response to the utterance request message, or based on a normal voice recognition result. It is determined whether the operation has been performed, that is, whether it is in the learning mode or the normal mode (S902). If it is determined in step S902 that the learning mode is set, it is determined whether the recognition target word that prompted the utterance has been registered in advance or has been uniquely registered by the owner (S903). Then, the threshold values T1 × K and T1 of the similarity are determined according to whether they are registered in advance or are registered independently by the owner.
2 × K and the values S1 × K, S2 × K of the update parameter k
Is determined (S904, S905). On the other hand, if it is determined in step S902 that the mode is the normal mode, it is determined whether the speech recognition target word has been registered in advance or has been uniquely registered by the owner (S906). ). Then, the threshold values T3 × K and T4 × K of the similarity and the values S3 × K and S4 × K of the update parameter k are determined depending on whether they have been registered in advance or registered by the owner. (S907,
S908). The parameters T1 to T4 and S1 to S4 are used to adjust the similarity and the update parameter, respectively. Then, the above step S9
04, S905, S907, and the threshold value of the similarity calculated in S908 and the similarity to the recognition target word to be updated are compared (S909). If the similarity of the recognition target word to be updated is larger than the similarity threshold, the process S904
The update parameter k calculated in S905, S907, and S908 is adopted (S910). When the similarity of the recognition target word to be updated is smaller than the similarity threshold, it is determined not to update the model, that is, the update parameter k is set to 0. It is set (S911). As described above, in the adaptive speech recognition device, the speech processing device, and the pet toy according to the present embodiment, the adverse effect on the updating of the speech model due to meaningless utterance or erroneous utterance is suppressed. Since the value of the speech parameter is determined according to the progress of the learning, etc., it is possible to prevent the learning from being biased toward the recognition target word, and to perform the learning quickly and appropriately when necessary. In addition, even if the power is suddenly turned off, all voice models are prevented from disappearing. For example, when two or more blocks of the flash memory are used, in most cases, at least, the update amount of the last update is used. Since the data can be used, there is almost no need to repeat the learning even when the updated speech model is lost. Further, by adjusting the voice parameters, it is possible to cause the pet toy to imitate, for example, a phenomenon in which the pet becomes familiar with the owner or actively responds only to the voice of the owner.

【００３１】[0031]

【実施例】上記実施の形態では，式（２）において移動
ベクトルΔｉｊ（ｎ）に更新パラメータｋを乗算してい
たが，これに限られるものではなく，上記移動ベクトル
Δｉｊ（ｎ）に加算するようにしてもよいし，乗算及び
加算するようにしてもよい。また，上記音声モデルの更
新に際し，上記音声モデル中の混合分布の全てを移動す
ることを前提にしているが，これに限られるものではな
く，例えば各混合分布毎（ｊ毎）にνｉ（ｎ）とμｉｊ
（ｎ）のベクトル距離を用いることにより，最もその値
が小さいものを移動するようにしてもよい。また，上記
実施の形態では，本発明に係る適応型音声認識装置，音
声処理装置をペット玩具に適用したが，これに限られる
ものではなく，カーナビゲーションシステムや，携帯用
の音楽再生装置などの他の装置に適用することも可能で
ある。また，上記実施の形態におけるＲＯＭ１０４やフ
ラッシュメモリ１０５，ワークメモリ１０６はそれぞれ
外付けのものを用いた方が，容量が確保しやすいが，こ
の一部又は全部について，プロセッサ１０３に内蔵した
ものを用いるようにしてもよい。In the above embodiment, the movement vector Δij (n) is multiplied by the update parameter k in equation (2). However, the present invention is not limited to this, and the movement vector Δij (n) is added to the movement vector Δij (n). Alternatively, multiplication and addition may be performed. Also, when updating the speech model, it is assumed that all of the mixture distributions in the speech model are moved. However, the present invention is not limited to this. For example, νi (n ) And μij
By using the vector distance of (n), the one having the smallest value may be moved. In the above embodiment, the adaptive speech recognition device and speech processing device according to the present invention are applied to a pet toy. However, the present invention is not limited to this. For example, a car navigation system or a portable music playback device may be used. It is also possible to apply to other devices. It is easier to secure the capacity of each of the ROM 104, the flash memory 105, and the work memory 106 in the above-described embodiment if they are externally attached. You may do so.

【００３２】また，上記実施の形態では，類似度がしき
い値以上の場合にのみモデル更新を行い，その際には学
習の進行度合いが低いほうがモデル更新のための適応速
度が速くなるよう更新パラメータを設定することにした
が，このような固定的なルールだけではうまくいかない
場合がある。例えば，学習の進行度合いが低い単語はモ
デル更新の実施を判断するしきい値が低く設定され，さ
らに適応速度も速くなるため，誤った単語のみで何回も
モデル更新されると，本来の正しい単語ではなく別の単
語で学習されてしまうことも起こる。これを避けるため
に，適応速度を例えば図２０のように，学習の進行度合
いと設定する適応速度の間に関数関係を設けてもよい。
この場合，必ずしも学習進行度合いと適応速度は単調な
大小関係ではなく，図２０のように学習が極端に進んで
いない単語に対しては，むしろ慎重なスピードで適応を
進めるなどの調整などが可能となる。さらには，このよ
うな固定的な関数関係を用いると，たまたまモデルの初
期値に対して全く高い類似度が得られないような音質を
持ったユーザであった場合には，いつまでたってもモデ
ル更新による学習が進行しないという現象も起こりうる
ので，このような固定的なルールに対して乱数などラン
ダムなパラメータを加えることにより，しきい値や適応
速度のパラメータを決めてもよい。そうすれば，どのよ
うな音質のユーザであっても，乱数によってモデル更新
が進む場合が発生するので，全くユーザに適応しないと
いう致命的な現象は避けることができる。このように，
パラメータの決定ルールには種々の変形が存在するが，
本発明はこれらも含む。In the above embodiment, the model is updated only when the similarity is equal to or greater than the threshold value. In this case, the lower the degree of progress of the learning, the higher the adaptation speed for updating the model. We decided to set the parameters, but these fixed rules alone may not work. For example, for words whose learning progress is low, the threshold for judging the model update is set low and the adaptation speed is also fast. In some cases, learning is done with another word instead of a word. In order to avoid this, a functional relationship may be provided between the degree of progress of learning and the set adaptive speed as shown in FIG. 20, for example.
In this case, the learning progress degree and the adaptation speed are not necessarily in a monotonous magnitude relation, and adjustments such as proceeding the adaptation at a cautious speed can be made for words for which the learning has not progressed extremely as shown in FIG. Becomes Furthermore, if such a fixed functional relationship is used, if the user happens to have a sound quality that cannot obtain a very high degree of similarity with the initial value of the model, the model can be updated forever. Since a phenomenon in which learning by does not progress may occur, a threshold or an adaptation speed parameter may be determined by adding a random parameter such as a random number to such a fixed rule. Then, even if the user has any kind of sound quality, the model may be updated by random numbers in some cases, so that a fatal phenomenon that the user does not adapt at all can be avoided. in this way,
There are various variants of the parameter determination rules.
The present invention includes these.

【００３３】[0033]

【発明の効果】以上説明した通り，上記請求項１に記載
の適応型音声認識装置によれば，入力音声の特徴量に基
づいて上記音声モデルを更新する際に，更新パラメータ
により上記音声モデルの更新度合いが制御されるため，
無意味な発声や誤った発声によって，更新される上記音
声モデルに与えられる悪影響を抑えることが可能とな
る。また，上記請求項２に記載の適応型音声認識装置に
よれば，上記更新パラメータが上記類似度に基づいて決
定されるため，例えば上記類似度が高くなるにつれて上
記更新度合いが大きくなるように上記更新パラメータを
決定することにより，更新される上記音声モデルに対す
る上記悪影響を抑えることができる。また，上記請求項
３に記載の適応型音声認識装置によれば，特定の認識対
象語の発声を促すメッセージが決定され，上記メッセー
ジが出力されてから所定時間内に上記入力音声が検出さ
れた場合には，当該入力音声に対応した上記更新パラメ
ータが上記音声モデルの更新度合いが高まるように決定
されるため，無意味な発声や誤った発声に対して更新が
行われる可能性を低くすることができる。また，上記請
求項４又は５に記載の適応型音声認識装置によれば，発
声内容を使用者に事前に通知する学習モードと通常モー
ドとで上記更新パラメータの決定が変更されるため，学
習モードでは更新度合いを優先させ，通常モードでは無
意味な発声等による悪影響を抑えながら使用者の負担を
軽減することができ，その結果効率的で効果的な学習を
行わせることができる。また，上記請求項６に記載の適
応型音声認識装置によれば，既に認識選択判定された認
識対象語のうち上記類似度の低い認識対象語に対応した
上記更新パラメータが上記音声モデルの更新度合いが高
まるように決定されるため，認識対象語の適応度合いの
偏りを軽減することができる。また，上記請求項７に記
載の適応型音声認識装置によれば，既に認識選択判定さ
れた認識対象語のうち上記類似度に応じて認識対象語の
発声を促すメッセージが優先的に決定されるため，認識
対象語の適応度合いの偏りを軽減することができる。ま
た，上記請求項８に記載の適応型音声認識装置によれ
ば，使用者により独自に登録された音声モデルに対して
優先的に話者適応化をさせることができる。また，上記
請求項９に記載の適応型音声認識装置によれば，使用者
により独自に登録された音声モデルに対応した認識対象
語の発声を促すメッセージを優先的に発声させて，結果
的に使用者により独自に登録された音声モデルに対して
優先的に話者適応化をさせることができる。また，上記
請求項１０に記載の適応型音声認識装置によれば，上記
入力音声の長さに従って上記入力音声が更新に適当なも
のでるか推定されるため，無意味な発声や誤った発声に
より，更新される上記音声モデルに与える悪影響を抑え
ることができる。As described above, according to the adaptive speech recognition apparatus of the first aspect, when updating the speech model based on the feature amount of the input speech, the speech model of the speech model is updated by the update parameter. Because the degree of update is controlled,
It is possible to suppress an adverse effect on the updated speech model due to a meaningless utterance or an erroneous utterance. Further, according to the adaptive speech recognition apparatus of the second aspect, since the update parameter is determined based on the similarity, for example, the update degree increases as the similarity increases. By determining the update parameter, the adverse effect on the updated speech model can be suppressed. According to the adaptive speech recognition device of the third aspect, a message prompting the utterance of a specific recognition target word is determined, and the input speech is detected within a predetermined time after the message is output. In such a case, the update parameter corresponding to the input speech is determined so as to increase the degree of update of the speech model, so that the possibility of updating the meaningless utterance or erroneous utterance is reduced. Can be. According to the adaptive speech recognition apparatus of the fourth or fifth aspect, the determination of the update parameter is changed between the learning mode for notifying the user of the utterance content in advance and the normal mode. In the normal mode, the burden on the user can be reduced while suppressing an adverse effect due to meaningless utterance or the like in the normal mode. As a result, efficient and effective learning can be performed. According to the adaptive speech recognition apparatus of the sixth aspect, the update parameter corresponding to the recognition target word having a low similarity among the recognition target words that have already been selected for recognition is the update degree of the speech model. Is determined so as to increase, so that the bias of the adaptation degree of the recognition target word can be reduced. According to the adaptive speech recognition apparatus of the seventh aspect, a message prompting the utterance of the recognition target word is preferentially determined according to the similarity among the recognition target words that have already been selected for recognition. Therefore, it is possible to reduce the bias of the adaptation degree of the recognition target word. Further, according to the adaptive speech recognition device of the eighth aspect, the speaker adaptation can be preferentially performed on the speech model uniquely registered by the user. According to the adaptive speech recognition apparatus of the ninth aspect, a message prompting the user to utter a recognition target word corresponding to a speech model uniquely registered by the user is preferentially uttered. The speaker adaptation can be preferentially performed on the voice model uniquely registered by the user. According to the adaptive speech recognition apparatus of the tenth aspect, it is estimated whether or not the input speech is appropriate for updating according to the length of the input speech. , The adverse effect on the updated speech model can be suppressed.

【００３４】また，上記請求項１１又は１２に記載の適
応型音声認識装置によれば，連続した入力音声に対応し
た類似度が所定のしきい値以上の場合に，上記音声モデ
ルの更新が行われるため，無意味な発声や誤った発声に
対して音声モデルの更新が行われる可能性を低くするこ
とができる。また，上記請求項１３〜１６のいずれか１
項に記載の適応型音声認識装置によれば，電源が不意に
オフされた場合でも，不揮発性で読み出し専用の第１の
記憶手段に少なくとも初期の音声モデルが記憶されてい
るため，音声モデルの完全な消失を防止することができ
る。また，上記請求項１７に記載の適応型音声認識装置
によれば，２つ以上のブロックのうち消去状態にあるブ
ロックに，上記音声モデルの更新部分と，エラーチェッ
ク用データと，上記フラッシュメモリの初期化後の書き
込み回数とが格納され，上記初期化後の書き込み回数が
最も多いブロックから，上記エラーチェック用データに
より上記音声モデルの更新部分が正常であるかが判別さ
れ，正常であると判別された上記音声モデルの更新部分
のうち最も上記初期化後の書き込み回数が多いブロック
にある上記音声モデルの更新部分が上記揮発性の記憶手
段に転送され，正常でないと判別された上記音声モデル
の更新部分があるブロックから上記音声モデルの更新部
分が消去されるため，電源が不意にオフされ，上記揮発
性の記憶手段に記憶されていた最新の更新に係る音声モ
デルが消失したとしても，その前の更新状態や，その前
々回の更新状態が上記フラッシュメモリを上記揮発性の
記憶手段に転送することができる。従って，音声モデル
の学習のやり直しがほとんど必要がなくなる。また，上
記請求項１８〜２１のいずれか１項に記載の音声処理装
置によれば，上記のように好適な更新が行われた音声モ
デルに基づいて認識された認識対象語に対応して使用者
へ応答が行われるため，適切な応答を選択することがで
きる。また，上記請求項２２又は２３に記載の音声処理
装置によれば，上記入力音声がある度に応答の発声が変
化させられるため，応答の多様性を確保することができ
る。また，上記請求項２４に記載のペット玩具によれ
ば，上記更新パラメータにより上記音声モデルの更新度
合いを調整することで，ペットが飼い主に徐々に馴染ん
でいく現象を模倣することにより，使用者がペット玩具
への愛着を持ちやすくしたり，使用者だけに積極的に応
答させたりすることができる。According to the adaptive speech recognition apparatus of the present invention, when the similarity corresponding to the continuous input speech is equal to or greater than a predetermined threshold, the speech model is updated. Therefore, it is possible to reduce the possibility that the speech model is updated for a meaningless utterance or an erroneous utterance. Further, in any one of the above-mentioned claims 13 to 16,
According to the adaptive speech recognition apparatus described in the section, even when the power supply is unexpectedly turned off, at least the initial speech model is stored in the non-volatile first read-only storage means. Complete disappearance can be prevented. According to the adaptive speech recognition apparatus of the seventeenth aspect, an updated part of the speech model, error check data, and a flash memory are added to an erased block of two or more blocks. The number of times of writing after initialization is stored, and from the block having the largest number of times of writing after initialization, it is determined whether the updated part of the voice model is normal by the error check data and is determined to be normal. The updated portion of the voice model in the block having the largest number of times of writing after the initialization among the updated portions of the voice model is transferred to the volatile storage means, and the updated voice model of the voice model determined to be abnormal is transferred. Since the updated part of the voice model is deleted from the block where the updated part is located, the power is suddenly turned off and stored in the volatile storage means. Even speech model according to the latest update had disappeared, it is possible that or previous update state, the update state of the second last and transfers the flash memory in the storage means of the volatility. Therefore, there is almost no need to repeat the learning of the speech model. According to the speech processing device of any one of claims 18 to 21, the speech processing apparatus is used in correspondence with a recognition target word recognized based on the speech model that has been suitably updated as described above. Since a response is made to the person, an appropriate response can be selected. Further, according to the voice processing device of the present invention, since the utterance of the response is changed each time the input voice is present, it is possible to ensure a variety of responses. According to the pet toy of the twenty-fourth aspect, by adjusting the update degree of the voice model by the update parameter, the user can imitate a phenomenon in which the pet gradually adapts to the owner, thereby enabling the user to do so. The attachment to the pet toy can be easily held, and only the user can positively respond.

[Brief description of the drawings]

【図１】本発明の一実施の形態に係る適応型音声認識
装置Ａ１の概略構成を示す機能ブロック図。FIG. 1 is a functional block diagram showing a schematic configuration of an adaptive speech recognition device A1 according to an embodiment of the present invention.

【図２】上記適応型音声認識装置Ａ１に必要なハード
ウェア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration required for the adaptive speech recognition apparatus A1.

【図３】バックトラック処理によって対応する状態を
求める過程を説明するための図。FIG. 3 is a diagram for explaining a process of obtaining a corresponding state by backtrack processing.

【図４】混合分布の移動を説明するための図。FIG. 4 is a diagram for explaining movement of a mixture distribution.

【図５】更新パラメータの決定例を説明するための
図。FIG. 5 is a diagram for explaining an example of determining update parameters.

【図６】通常モードと学習モードとで更新パラメータ
の決定値を変化させる場合を説明する図。FIG. 6 is a view for explaining a case where a determined value of an update parameter is changed between a normal mode and a learning mode.

【図７】入力音声の長さに応じて音声モデルの更新度
合いを調整することを説明するための図。FIG. 7 is a view for explaining how to adjust the update degree of a voice model according to the length of an input voice.

【図８】本発明の実施の形態に係る音声処理装置Ａ２
及びペット玩具Ａ３の概略構成を示す図。FIG. 8 shows an audio processing device A2 according to an embodiment of the present invention.
The figure which shows the schematic structure of pet toy A3.

【図９】しきい値を複数用いた場合に，類似度に応じ
た応答レベルの選択を行うことを説明するための図。FIG. 9 is a view for explaining that a response level is selected in accordance with a degree of similarity when a plurality of threshold values are used.

【図１０】しきい値を一つだけ用いた場合に，類似度
に応じた応答レベルの選択を行うことを説明するための
図。FIG. 10 is a view for explaining that a response level is selected in accordance with a degree of similarity when only one threshold value is used.

【図１１】一つの応答レベルに対して複数応答が設定
されている場合を説明するための図。FIG. 11 is a diagram for explaining a case where a plurality of responses are set for one response level.

【図１２】上記フラッシュメモリのある一ブロックへ
上記変位分のデータが正常に転送された場合の様子を説
明するための図。FIG. 12 is a view for explaining a state where data corresponding to the displacement is normally transferred to one block of the flash memory;

【図１３】上記フラッシュメモリのある一ブロックへ
上記変位分のデータが正常に転送されなかった場合の様
子を説明するための図。FIG. 13 is a view for explaining a state in which data of the displacement is not normally transferred to one block of the flash memory.

【図１４】上記フラッシュメモリの２ブロックを用い
たときに，最新の上記変位分のデータが正常に転送され
た場合の様子を説明するための図。FIG. 14 is a view for explaining a state where the latest data of the displacement is normally transferred when two blocks of the flash memory are used.

【図１５】上記フラッシュメモリの２ブロックを用い
たときに，最新の上記変位分のデータが正常に転送され
なかった場合の様子を説明するための図。FIG. 15 is a view for explaining a state where the latest data of the displacement is not normally transferred when two blocks of the flash memory are used.

【図１６】本発明の実施の形態に係るペット玩具Ａ
３’の概略構成を示す図。FIG. 16 is a pet toy A according to an embodiment of the present invention.
The figure which shows schematic structure of 3 '.

【図１７】本発明の実施の形態に係る適応型音声認識
装置Ａ１’，音声処理装置Ａ２’，及びペット玩具Ａ
３’に関する機能ブロック図。FIG. 17 shows an adaptive speech recognition device A1 ′, a speech processing device A2 ′, and a pet toy A according to an embodiment of the present invention.
Functional block diagram regarding 3 '.

【図１８】上記適応型音声認識装置Ａ１’における発
声要求決定の手順例を説明するフローチャート。FIG. 18 is a flowchart illustrating an example of a procedure for determining an utterance request in the adaptive speech recognition apparatus A1 ′.

【図１９】上記適応型音声認識装置Ａ１’におけるモ
デル更新の手順例を説明するフローチャート。FIG. 19 is a flowchart illustrating an example of a model update procedure in the adaptive speech recognition apparatus A1 ′.

【図２０】学習の進行度合と適応速度の関数関係の一
例を示す図。FIG. 20 is a diagram illustrating an example of a functional relationship between a learning progress degree and an adaptation speed.

[Explanation of symbols]

１…特徴抽出部２…音声モデ
ル記憶部３…評価値計算部４…結果判定
部５…音声モデル更新部６…更新パラ
メータ決定部７…スイッチ８…応答制御
部９…発声要求部１０…認識結
果履歴記憶部７１，７２…スイッチ１０３…プロ
セッサ１０４…ＲＯＭ１０５…フラ
ッシュメモリ１０６…ワークメモリ２０１…可動
部２０２…モータ２０３…駆動
部２０４…音声合成部DESCRIPTION OF SYMBOLS 1 ... Feature extraction part 2 ... Speech model storage part 3 ... Evaluation value calculation part 4 ... Result judgment part 5 ... Speech model update part 6 ... Update parameter determination part 7 ... Switch 8 ... Response control part 9 ... Speech request part 10 ... Recognition Result history storage units 71, 72 switch 103 processor 104 ROM 105 flash memory 106 work memory 201 movable unit 202 motor 203 drive unit 204 voice synthesis unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 3/00 ５５１Ｈ５７１Ｔ (72)発明者橋本裕志兵庫県神戸市西区高塚台１丁目５番５号株式会社神戸製鋼所神戸総合技術研究所内 (72)発明者西元善郎兵庫県神戸市西区高塚台１丁目５番５号株式会社神戸製鋼所神戸総合技術研究所内Ｆターム(参考） 2C150 BA11 CA02 DF02 DF04 DF31 ED56 EF30 5D015 GG01 GG04 GG06 KK01 KK02 KK04 LL10 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 3/00 551H 571T (72) Inventor Hiroshi Hashimoto 1-5-5 Takatsukadai, Nishi-ku, Kobe-shi, Hyogo Stock Kobe Steel, Ltd. Kobe Research Institute (72) Inventor Yoshiro Nishimoto 1-5-5 Takatsukadai, Nishi-ku, Kobe-shi, Hyogo Prefecture Kobe Steel Works Kobe Research Institute F-term (reference) 2C150 BA11 CA02 DF02 DF04 DF31 ED56 EF30 5D015 GG01 GG04 GG06 KK01 KK02 KK04 LL10

Claims

[Claims]

1. A feature amount extraction unit for extracting a feature amount of an input speech, a speech model storage unit for storing a speech model corresponding to a recognition target word, the feature amount extracted by the feature amount extraction unit, A similarity calculating means for calculating a similarity with the voice model stored in the voice model storing means; and a recognition target word corresponding to the input voice based on the similarity calculated by the similarity calculating means. A speech model updating means for updating the speech model stored in the speech model storage means based on the feature quantity extracted by the feature quantity extracting means; An adaptive speech recognition apparatus comprising: an update parameter determining unit that determines an update parameter for controlling a degree of updating of the voice model by the voice model updating unit.

2. The adaptive speech recognition apparatus according to claim 1, wherein said update parameter determination means determines said update parameter based on said similarity calculated by said similarity calculation means.

3. An utterance request deciding means for deciding and outputting a message prompting utterance of a specific recognition target word, wherein said update parameter deciding means outputs said message decided by said utterance request deciding means. The adaptive type according to claim 1, wherein when the input voice is detected within a predetermined time from, the update parameter corresponding to the input voice is determined so that the degree of updating of the voice model is increased. Voice recognition device.

4. A mode selecting means for selecting one of a normal mode and a learning mode for notifying the user of the utterance content in advance, wherein said update parameter determining means selects said mode by said mode selecting means. And determining the update parameter according to the set mode.
An adaptive speech recognition device according to any one of the preceding claims.

5. The degree of updating of the voice model when the learning mode is selected by the mode selecting means, rather than when the normal mode is selected by the mode selecting means. 5. The adaptive speech recognition device according to claim 4, wherein the update parameter is determined so that the value of the update parameter increases.

6. A recognition result history storage unit for storing a history of recognition target words already selected and determined by the recognition target word selection determining unit and a similarity thereof, and wherein the update parameter determining unit includes the recognition result history. The update parameter corresponding to the recognition target word having a low similarity among the recognition target words stored in the storage unit is determined so that the degree of update of the speech model is increased. An adaptive speech recognition device according to the section.

7. A recognition result history storage unit for storing a history of recognition target words already selected and determined by the recognition target word selection determining unit and a similarity thereof, wherein the utterance content requesting unit stores the recognition result history. 4. The adaptive speech recognition device according to claim 3, wherein a message prompting the utterance of the recognition target word among the recognition target words stored in the storage unit is preferentially determined according to the similarity.

8. When the voice model includes a first model registered in advance and a second model uniquely registered by a user, the update parameter determining means includes:
8. An update parameter corresponding to the speech model belonging to the second model is determined so that the degree of updating of the speech model belonging to the first model is higher than that of the speech model belonging to the first model. The adaptive speech recognition device according to claim 1.

9. When the voice model includes a first model registered in advance and a second model uniquely registered by a user, the utterance content request unit transmits the first model.
4. A message which prompts the user to utter a recognition target word corresponding to the voice model belonging to the second model at a higher frequency than the voice model belonging to the model.
Or the adaptive speech recognition device according to 7.

10. The adaptive speech recognition apparatus according to claim 1, wherein said update parameter determining means weights said update parameters based on a length of said input speech.

11. The update parameter determining means sets the update parameter to a larger value than when the similarity corresponding to the continuous input speech is equal to or greater than a predetermined threshold value. The adaptive speech recognition apparatus according to any one of claims 1 to 10, wherein the apparatus is set.

12. The continuous input voice includes a recognition target word corresponding to a first model registered in advance in the voice model and a recognition target word corresponding to a second model uniquely registered by a user. The adaptive speech recognition device according to claim 11, wherein the combination is a combination with the following.

13. The voice model storage means includes a non-volatile first read-only storage means and a non-volatile rewritable second storage means. 13. The speech model update part stored in the first storage means and updated by the speech model update means is stored in the second storage means.
An adaptive speech recognition device according to any one of the preceding claims.

14. When the power is turned off when the voice model updated by the voice model updating means is stored in the volatile storage means, the update stored in the volatile storage means is performed. 14. The adaptation according to claim 13, wherein an updated part of the speech model is generated by subtracting the speech model stored in the first storage means from the completed speech model, and stored in the second storage means. Type speech recognition device.

15. When the power is turned on, the voice model stored in the first storage means and an updated part of the voice model stored in the second storage means are added. 15. The adaptive speech recognition apparatus according to claim 14, wherein said apparatus is transferred to said volatile storage means.

16. The adaptive speech recognition apparatus according to claim 15, wherein said first storage means is a ROM, and said second storage means is a flash memory for erasing each block.

17. When two or more blocks of the flash memory are used, an updated part of the voice model, error check data, and the flash memory are added to an erased block of the two or more blocks. And the number of times of writing after initialization are stored. From the block having the largest number of times of writing after initialization, it is determined whether or not the updated part of the voice model is normal by the error check data. The updated voice model update portion in the block having the largest number of times of writing after the initialization among the determined update portions of the voice model is transferred to the volatile storage means, and the voice model determined to be abnormal is transferred. 17. The adaptive speech recognition apparatus according to claim 16, wherein said updated part of said speech model is deleted from a block having said updated part.

18. An adaptive speech recognition apparatus according to claim 1 and said recognition target word selected and determined by said recognition target word selection determining means.
A speech processing apparatus comprising: a response control unit that selects a response content stored in advance corresponding to the recognition target word and controls a response to the input voice.

19. A response level determining means for determining the response level based on the similarity calculated by the similarity calculating means, wherein the response content is divided into a plurality of response levels and stored. 19. The voice processing device according to claim 18, comprising:

20. The response level determination means sets the similarity threshold for each response level,
20. The speech processing device according to claim 19, wherein the response level is determined by comparing the similarity with a plurality of thresholds.

21. The response level determining means performs arithmetic processing on the similarity using different coefficients for each response level, and performs the arithmetic processing on the similarity using the coefficients. 20. The voice processing device according to claim 19, wherein the response level is determined by comparing the response level with a predetermined threshold value.

22. The apparatus according to claim 2, wherein a part or all of the plurality of thresholds is changed every time the input voice is present.
0. The audio processing device according to 0.

23. The audio processing apparatus according to claim 21, wherein a part or all of the coefficients are changed every time the input audio is present.

24. An audio processing apparatus according to claim 18, comprising: a voice synthesizing unit; and a driving unit for driving a movable unit, wherein said response synthesizing unit controls said voice synthesizing unit. And a pet toy which responds to the input voice by controlling the driving means.