JP3398401B2

JP3398401B2 - Voice recognition method and voice interaction device

Info

Publication number: JP3398401B2
Application number: JP21176892A
Authority: JP
Inventors: デイビットグリーブス; 仁史永田; 洋一竹林; 重宣瀬戸; 泰樹山下
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-03-16
Filing date: 1992-08-07
Publication date: 2003-04-21
Anticipated expiration: 2018-04-21
Also published as: JPH05323993A

Abstract

PURPOSE:To fetch and recognize audio input from a reader even when an audio response is issued from a system by cancelling the audio response when the audio response outputted from a speaker is inputted from a microphone. CONSTITUTION:An audio signal inputted from the microphone 1 is supplied to a speech recognition part 5 as it is. After that, the audio response for speech recognized by an interactive control part 6 is selected, and the audio response is outputted from an audio response part 7, and is supplied to an adaptive filter 3, and also, it is outputted from the speaker 8. Meanwhile, input speech on which the audio response is superimposed is fetched in the microphone 1 from the speaker 8, and it is supplied to a subtractor 4. The output of the speaker 8 fetched from the microphone 1 is cancelled by subtracting a signal in which the audio response outputted from the speaker 8 is corrected by using LSM/Newton algorythm from a signal inputted from the microphone 1.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、人間と計算機が音声で
対話する音声対話システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice dialogue system in which a human and a computer talk by voice.

【０００２】[0002]

【従来の技術】近年、人間と計算機とのインターフェー
スとして、音声情報を用いた音声対話システムの開発が
盛んに進められている。2. Description of the Related Art In recent years, a voice dialogue system using voice information has been actively developed as an interface between a human and a computer.

【０００３】音声対話システムは、音声出力とともにグ
ラフィック情報や画像，アニメーション等の視覚データ
の表示を行なうマルチメディア対話システムとして有効
であり、話者がマイクロホンに向かって発話すると、こ
の音声を認識し、これに対する音声応答をスピーカから
出力して人間との対話を行なうものである。このような
音声対話システムを、例えばハンバーガーショップで用
いた例を説明する。まず、客がマイクロホンに向かって
「ハンバーガー２個とジュース３個」と発話すると、シ
ステムはこれを認識し、「ハンバーガー２個とジュース
３個ですね」と確認を示す発話が出力される。その後、
客が「はい」と返事をすれば、注文がハンバーガー２個
とジュース３個であることが確認され、従業員に通知さ
れる。The voice dialogue system is effective as a multimedia dialogue system for displaying visual information such as graphic information, images and animations together with voice output. When a speaker speaks into a microphone, the voice is recognized, A voice response to this is output from a speaker to have a dialogue with a human. An example of using such a voice dialogue system at a hamburger shop will be described. First, when the customer utters "2 hamburgers and 3 juices" into the microphone, the system recognizes this and outputs a utterance that confirms "2 hamburgers and 3 juices". afterwards,
If the customer answers yes, the order is confirmed to be 2 burgers and 3 juices and the employee is notified.

【０００４】ところが、客が誤って、「ハンバーガー３
個…」と言ってしまった場合には、即時に取消すことは
できず、システムが「ハンバーガー３個…ですね」と確
認の応答がされたときに取消しをして、再度、「ハンバ
ーガー２個…」と発話しなければならない。また、例え
ば客が「ハンバーガー２個とコーラとアイスクリームを
下さい」と言った場合に、システムが誤認識して、「ポ
テト４個とコーラとアイスクリームですね」という応答
がされてしまった場合には、客は、「ポテト４個…」と
応答があった時点で直ちに割込んで訂正したいが、シス
テムの応答がすべて終了するまで訂正することはできな
い。このため、対話に長時間を要してしまい、非常に煩
らわしい。However, the customer mistakenly says, "Hamburger 3
If you say "individual ...", you can not cancel immediately, but when the system responds with a confirmation "3 hamburgers ...", you cancel and again "2 hamburgers". ... ". Also, for example, when a customer says "Please give me 2 burgers, cola and ice cream", the system misrecognizes and gives a response "4 potatoes, cola and ice cream". The customer wants to immediately interrupt and correct the response “4 potatoes ...”, but cannot correct it until all the system responses are completed. Therefore, the dialogue takes a long time, which is very annoying.

【０００５】[0005]

【発明が解決しようとする課題】このように、従来にお
ける音声対話システムでは、話者からの音声入力と音声
応答出力とを同時に行なうことはできず、システムから
の応答音声がすべて終了した後に、音声を入力しなけれ
ばならない。従って、システムが誤認識した際には、再
度入力するために長時間を有してしまい、効率の良い対
話ができないという欠点があった。As described above, in the conventional voice dialogue system, the voice input from the speaker and the voice response output cannot be performed at the same time, and after all the response voices from the system are finished, You have to input the voice. Therefore, when the system erroneously recognizes, it takes a long time to input again, and there is a disadvantage that an efficient dialogue cannot be performed.

【０００６】この発明はこのような従来の課題を解決す
るためになされたもので、その第１の目的は、システム
が音声応答を発しているときにおいても、話者からの音
声入力を取込んで認識することのできる音声認識方法及
び音声対話装置を提供することである。The present invention has been made to solve such a conventional problem, and a first object thereof is to capture a voice input from a speaker even when the system emits a voice response. Voice recognition method that can be recognized by
And a voice interaction device .

【０００７】また、第２の目的は、認識内容と応答内容
の重要度に応じて音声応答の出力を変更し得る音声対話
装置を提供することである。The second purpose is a voice dialogue in which the output of the voice response can be changed according to the importance of the recognition content and the response content.
It is to provide a device .

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するた
め、本願第１の発明は、マイクロホンなどの入力手段か
ら入力された音声を認識し、この認識結果に基づいて所
定の音声応答をスピーカなどの出力手段から出力して対
話を行なうための音声認識方法において、前記出力手段
から出力された音声応答のインパルス応答の周波数スペ
クトルである伝送関数を、前記音声応答を合成するため
の応答生成パラメータを用いて推定し、推定された前記
インパルス応答により該音声応答を補正し、補正された
前記音声応答のみを入力された音声からキャンセル
し、該音声応答をキャンセルした後の音声を認識するこ
とを特徴とする。In order to achieve the above object, the first invention of the present application recognizes a voice input from an input means such as a microphone, and outputs a predetermined voice response based on the recognition result to a speaker or the like. In the voice recognition method for outputting the dialogue from the output means, the output means
The frequency spectrum of the impulse response of the voice response output from
To synthesize a transfer function that is a koutor to the voice response
Estimated using the response generation parameter of
The voice response is corrected by an impulse response, only the corrected voice response is canceled from the input voice, and the voice after canceling the voice response is recognized.

【０００９】また、本願第２の発明は、音声入力がない
状態での背景雑音パワーを求める手段と、合成音声出力
時のインパルス応答を基にマイクロホン信号中の合成音
パワーを求める手段と、前記背景雑音パワーと前記合
成音パワーとの和を音声入力パワーの検出しきい値と
し、該しきい値を超えた音声入力パワーの継続時間を基
に音声入力があるか否かを判定する手段と、この判定手
段によって音声入力があると判断されたときのみ音声認
識を行なう手段と、を具備することを特徴とする。The second invention of the present application further comprises means for obtaining the background noise power in the absence of voice input, means for obtaining the synthesized sound power in the microphone signal based on the impulse response at the time of outputting the synthesized voice, A means for determining whether or not there is a voice input based on the duration of the voice input power exceeding the threshold, with the sum of the background noise power and the synthesized voice power being the detection threshold of the voice input power. And means for recognizing voice only when it is judged by the judging means that there is voice input.

【００１０】更に、本願第３の発明は、音声、キーボー
ド、ポインティングデバイスのうち少なくとも１つによ
る利用者からの入力を認識するパターン認識手段と、こ
のパターン認識手段による理解結果に基づいて音声応
答、画像応答の応答内容を決定する対話管理手段と、前
記パターン認識手段による理解結果及び前記対話管理手
段から出力される応答内容に基づいて、利用者からの割
込みを受付けるか否かを判定する割込制御手段と、この
割込制御手段からの割込制御情報及び対話管理手段から
の応答内容に基づいて出力中の画像応答や音声応答を打
切るか、もしくは前記割込制御情報及び前記応答内容に
基づいて出力中の画像応答や音声応答の発話速度・韻律
・パワー等の応答生成パラメータを変更して該画像応
答や音声応答を出力する応答生成出力手段と、を有する
ことを特徴とする。Further, the third invention of the present application is a pattern recognition means for recognizing an input from a user by at least one of a voice, a keyboard and a pointing device, and a voice response based on an understanding result by the pattern recognition means. An interrupt management unit that determines the response content of the image response, and an interrupt that determines whether or not to accept an interrupt from the user, based on the understanding result by the pattern recognition unit and the response content output from the interaction management unit. Based on the control means and the interrupt control information from this interrupt control means and the response content from the dialogue management means, the image response or the voice response being output is terminated, or the interrupt control information and the response content are
It characterized by having a a response generating an output means for outputting the image response and voice response by changing the response generation parameters such as the speech rate, prosody power of the image response and voice response in the output based.

【００１１】[0011]

【作用】上述の如く構成された本願第１の発明では、音
声応答におけるパワー，ピッチ等の音声特性によって音
声応答が補正され、この補正された信号がマイクロホン
入力から減算される。従って、音声応答が重畳したユー
ザの発話信号から、音声応答が除去された後、音声が認
識される。このため、音声応答出力中においてもユーザ
の発話を行なうことができるようになる。In the first aspect of the present invention configured as described above, the voice response is corrected by the voice characteristics such as power and pitch in the voice response, and the corrected signal is subtracted from the microphone input. Therefore, after the voice response is removed from the speech signal of the user on which the voice response is superimposed, the voice is recognized. Therefore, the user can speak even during the voice response output.

【００１２】また、音声応答信号を平滑化する平滑化フ
ィルタを設け、この出力を基に、音声応答が出力されて
いないときには適応化を停止するように制御すれば、音
声応答が出力されていないときに伝達関数推定精度が低
下することはなく、高い推定精度を維持することができ
る。If a smoothing filter for smoothing the voice response signal is provided and the output is controlled so as to stop the adaptation when the voice response is not output, the voice response is not output. Sometimes the transfer function estimation accuracy does not decrease, and high estimation accuracy can be maintained.

【００１３】また、本願第２の発明では、予め背景雑音
のパワーを求め、これよりも大きい入力があったときに
入力された音声を認識している。そして、音声応答が完
全に除去されず、スピーカからの音声応答がマイクロホ
ンから取込まれた場合でも、この音声応答のパワーに応
じて音声入力を認識する際のしきい値を上下させること
によって誤入力を防止している。従って、高精度な音声
入力が可能となる。Further, in the second invention of the present application, the power of the background noise is obtained in advance, and the input voice is recognized when the input is larger than this. Then, even if the voice response is not completely removed and the voice response from the speaker is captured from the microphone, an error may occur by raising or lowering the threshold for recognizing the voice input according to the power of the voice response. Input is prevented. Therefore, highly accurate voice input is possible.

【００１４】更に、本願第３の発明では、音声応答中に
利用者からの割込入力があった場合にこの入力内容の重
要度及び音声応答の重要度を基に、割込を許可すべきか
否かが決められ音声応答の出力が制御される。これによ
って、入力音声及び音声応答の内容に応じた高度な対話
が可能となる。Further, in the third invention of the present application, when an interrupt input is made by the user during a voice response, whether the interrupt should be permitted based on the importance of the input contents and the importance of the voice response. It is determined whether or not to output the voice response. As a result, it is possible to perform a high level dialogue depending on the contents of the input voice and the voice response.

【００１５】[0015]

【実施例】以下、本発明の実施例を図面に基づいて説明
する。図１は本発明が適用された音声対話システムの第
１実施例を示す構成図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a first embodiment of a voice dialogue system to which the present invention is applied.

【００１６】図示のように、この音声対話システムは、
話者からの入力音声を取込むマイクロホン１と、システ
ムの音声応答を出力するスピーカ８と、話者からの入力
音声に重畳された音声応答を除去する音声応答除去部２
と、この音声応答除去部２の出力を取込んで話者の発話
内容を認識する音声認識部５と、認識された音声に対応
する音声応答を選択制御する対話制御部６と、実際に音
声応答をスピーカ８、及び音声応答除去部２に出力する
音声応答部７及び、グラフィック情報や画像，アニメー
ション等の視覚データを表示するディスプレイ１６から
構成されている。As shown, this spoken dialogue system
A microphone 1 for capturing an input voice from a speaker, a speaker 8 for outputting a voice response of the system, and a voice response removing unit 2 for removing a voice response superimposed on the input voice from the speaker.
A voice recognition unit 5 that captures the output of the voice response removal unit 2 to recognize the utterance content of the speaker; a dialogue control unit 6 that selectively controls the voice response corresponding to the recognized voice; The voice response unit 7 outputs a response to the speaker 8 and the voice response removal unit 2, and the display 16 displays visual data such as graphic information, images, and animations.

【００１７】音声応答除去部２は、各種音声応答のパワ
ー情報，ピッチ情報，振幅情報、及び有声／無声，無音
等の情報が予め記憶されるルックアップテーブル３ａ
と、後述するＬＭＳ／ニュートンアルゴリズムによって
インパルス応答を求め、これによって音声応答を補正し
て出力するアダプティブフィルタ３と、マイクロホン１
の入力からアダプティブフィルタ３の出力を減じる減算
器４を有している。The voice response removing unit 2 stores in advance a look-up table 3a in which power information, pitch information, amplitude information of various voice responses, voiced / unvoiced, and silent information are stored in advance.
And an adaptive filter 3 for obtaining an impulse response by an LMS / Newton algorithm described later, and correcting and outputting a voice response by the impulse response, and a microphone 1.
Has a subtractor 4 for subtracting the output of the adaptive filter 3 from the input.

【００１８】このような構成において、以下、本実施例
の動作を図３に示すフローチャートを参照しながら説明
する。In such a structure, the operation of this embodiment will be described below with reference to the flow chart shown in FIG.

【００１９】まず、マイクロホン１から話者が音声を入
力すると、この音声信号は音声応答除去部２を介して音
声認識部５に供給される。このとき、音声応答部７から
の出力はないので、音声応答除去部２での処理は行なわ
れず、マイクロホン１から入力された音声信号はそのま
ま音声認識部５に供給される。その後、対話制御部６で
は認識された音声に対する音声応答が選択され（ステッ
プＳＴ１）、この音声応答が音声応答部７から出力され
るので、アダプティブフィルタ３に音声応答が供給され
るとともに、スピーカ８から出力される（ステップＳＴ
２，ＳＴ３）。First, when a speaker inputs a voice from the microphone 1, this voice signal is supplied to the voice recognition unit 5 via the voice response removal unit 2. At this time, since there is no output from the voice response unit 7, the voice response removal unit 2 does not perform the process, and the voice signal input from the microphone 1 is directly supplied to the voice recognition unit 5. After that, the dialogue control unit 6 selects a voice response to the recognized voice (step ST1), and this voice response is output from the voice response unit 7. Therefore, the voice response is supplied to the adaptive filter 3 and the speaker 8 is also supplied. Is output from (STEP ST
2, ST3).

【００２０】そして、アダプティブフィルタ３では、次
の（１）式によってインパルス応答を求める。Then, the adaptive filter 3 obtains an impulse response by the following equation (1).

【００２１】[0021]

【数１】Ｗ_(k+1)＝Ｗ_(k)＋２μＲ′_(k)ｅ_(k)Ｘ_(k) …（１）（１）式はＬＭＳ／ニュートンアルゴリズムと称する演
算式である。ここで、ｋは時相を示す因子であり、ｋが
今回の出力、ｋ＋１が次回の出力である。また、Ｒ′は
音声応答の相関マトリクスの逆行列であり、ルックアッ
プテーブル３ａから与えられる。## EQU1 ## W _{(k + 1)} = W _(k) + 2μR ' _(k) e _(k) X _(k ) (1) Equation (1) is an arithmetic expression called LMS / Newton algorithm. Here, k is a factor indicating the time phase, k is the current output, and k + 1 is the next output. R'is an inverse matrix of the correlation matrix of the voice response, which is given from the look-up table 3a.

【００２２】μは集束係数であり、スピーカ８から出力
された音声応答は、そのままマイクロホン１に入力され
るわけではなく、周囲環境によって反射や減衰等が生じ
る。μはこれらの変化を加味して伝達関数Ｗを決めるた
めの因子である。また、ｅはエラー、Ｘは入力信号ベク
トルである。Μ is a focusing coefficient, and the voice response output from the speaker 8 is not directly input to the microphone 1 but is reflected or attenuated depending on the surrounding environment. μ is a factor for determining the transfer function W in consideration of these changes. Further, e is an error and X is an input signal vector.

【００２３】こうして求められたインパルス応答を音声
応答Ｘに乗じて出力信号ｙを生成し、減算器４へ出力す
る（ステップＳＴ４）。The impulse response thus obtained is multiplied by the voice response X to generate an output signal y, which is output to the subtractor 4 (step ST4).

【００２４】即ち、ｙ＝Ｗ^TＸ（Ｔは転置） …（２）である。That is, y = W ^T X (T is a transpose) (2).

【００２５】一方、マイクロホン１では、スピーカ８か
らの音声応答が重畳した入力音声が取込まれる。そし
て、取込まれた音声信号ｄは減算器４に供給され（ステ
ップＳＴ５）、減算器４では減算信号ｓが次の（３）式
で求められる（ステップＳＴ６）。On the other hand, the microphone 1 takes in the input voice on which the voice response from the speaker 8 is superimposed. Then, the taken-in audio signal d is supplied to the subtractor 4 (step ST5), and the subtractor 4 obtains the subtraction signal s by the following equation (3) (step ST6).

【００２６】ｓ＝ｄ−ｙ …（３）その後、この減算信号ｓは音声認識部５に供給されて
（ステップＳＴ７）、話者からの入力音声が認識され、
これに対応する音声応答が対話制御部６によって選択さ
れ、音声応答部７から出力される。そして、アダプティ
ブフィルタ３は、この音声応答を取込んで次のインパル
ス応答を求め（ステップＳＴ８）、上述した動作が音声
入力が終了するまで繰り返される（ステップＳＴ９）。S = d−y (3) Then, the subtraction signal s is supplied to the voice recognition unit 5 (step ST7), and the input voice from the speaker is recognized,
A voice response corresponding to this is selected by the dialogue control unit 6 and output from the voice response unit 7. Then, the adaptive filter 3 takes in the voice response and obtains the next impulse response (step ST8), and the above-described operation is repeated until the voice input is completed (step ST9).

【００２７】このようにして、本実施例では、スピーカ
８から出力される音声応答をＬＳＭ／ニュートンアルゴ
リズムを用いて補正し、補正後の信号をマイクロホン１
から入力された信号から減じることで、マイクロホン１
から取込まれるスピーカ８の出力をキャンセルしてい
る。従って、音声応答がスピーカ８から出力されている
際においても、話者はマイクロホン１から音声を入力す
ることができるようになる。In this way, in this embodiment, the voice response output from the speaker 8 is corrected by using the LSM / Newton algorithm, and the corrected signal is input to the microphone 1.
Microphone 1 by subtracting from the signal input from
The output of the speaker 8 taken from is canceled. Therefore, even when the voice response is output from the speaker 8, the speaker can input the voice from the microphone 1.

【００２８】また、上記実施例では、音声応答の自己相
関マクリクスの逆数Ｒ′を用いてアルゴリズムを実施し
たが、音声応答が規則合成されている場合には、音声の
パワー，有声／無声，母音／子音，無音，持続時間情
報、等を用いても良い。特に、音声のパワーｐを用い
て、ＬＭＳ／ニュートンアルゴリズムを実施する場合
は、次の（４）式に示す演算式が用いられる。Further, in the above embodiment, the algorithm is carried out by using the reciprocal R'of the autocorrelation macrix of the voice response, but when the voice response is regularly synthesized, the power of voice, voiced / unvoiced, vowel / Consonant, silence, duration information, etc. may be used. In particular, when the LMS / Newton algorithm is implemented using the power p of the voice, the arithmetic expression shown in the following expression (4) is used.

【００２９】[0029]

【数２】Ｗ_(k+1)＝Ｗ_(k)＋２（μ／ｐ_(k)Ｌ）ｅ_(k)Ｘ_(k) …（４）ただし、Ｌは入力音声ベクトルの次元である。また、本
実施例の音声対話システムでは、予めルックアップテー
ブル３ａ内に、音声応答のパワー情報，ピッチ情報等の
特性が記憶されているので、音声応答の特性に応じた好
適なインパルス応答を得ることができる。## EQU2 ## W _{(k + 1)} = W _(k) +2 (μ / p _(k) L) e _(k) X _(k) (4) where L is the dimension of the input speech vector. Further, in the voice dialogue system of the present embodiment, since characteristics such as power information and pitch information of the voice response are stored in advance in the lookup table 3a, a suitable impulse response according to the characteristic of the voice response is obtained. be able to.

【００３０】図２は、音声応答のパワー情報と、音声応
答除去部２での除去結果を示す特性図であり、曲線Ｓ₃
は音声応答のパワー情報、曲線Ｓ₁はこのパワー情報を
一定値としてアルゴリズムを実施したときの音声応答の
除去結果、そして、曲線Ｓ₂はパワー情報が曲線Ｓ₃の
如く変化したときのデータを基にアルゴリズムを実施し
たときの音声応答の除去結果である。同図から明らかな
ように、ルックアップテーブル３ａ内に記憶されたパワ
ー情報を用いてアルゴリズムを実施した方が音声応答の
除去結果が良好であり、高精度に音声応答を除去できる
ことが理解される。FIG. 2 is a characteristic diagram showing the power information of the voice response and the removal result by the voice response removing unit 2, which is the curve S ₃
Is the power information of the voice response, the curve S ₁ is the removal result of the voice response when the algorithm is performed with this power information as a constant value, and the curve S ₂ is the data when the power information changes as the curve S _3. It is the removal result of the voice response when the algorithm is implemented based on. As is apparent from the figure, it is understood that the result of removing the voice response is better and the voice response can be removed with higher accuracy when the algorithm is executed using the power information stored in the lookup table 3a. .

【００３１】また、この実施例ではスピーカ８から発話
される応答が音声のみの例について述べたが、音声と同
時に音楽を出力させたい場合には、図１に示す音声応答
部７を図６の如く構成する。即ち、音声応答部７は音声
信号を出力する音声合成部１０と、音楽信号を出力する
音楽合成部１１、及びこれらを合成するミキサ９を有し
ている。そして、音楽の特性情報は、音符から容易に入
手することができ、これを図１に示すルックアップテー
ブル３ａ内に記憶させれば、前述した音声信号のみの場
合と同様に、音声応答を除去することができる。In this embodiment, the case where the response uttered from the speaker 8 is only voice is described. However, when it is desired to output music simultaneously with voice, the voice response unit 7 shown in FIG. Configure as follows. That is, the voice response unit 7 includes a voice synthesis unit 10 that outputs a voice signal, a music synthesis unit 11 that outputs a music signal, and a mixer 9 that synthesizes these. Then, the characteristic information of music can be easily obtained from the musical note, and if this is stored in the look-up table 3a shown in FIG. 1, the voice response is removed as in the case of only the voice signal described above. can do.

【００３２】また、音声，音楽だけでなく、自然音（鳥
の鳴き声等）やブザー音等の音響信号に対しても適用可
能である。ブザー音は周期信号であり、また、ランダム
雑音は不規則であるが定常雑音であるという性質が予め
わかっているので、これらの情報を利用して高精度なノ
イズキャンセルが行なえる。Further, the present invention can be applied not only to voice and music, but also to sound signals such as natural sounds (bird's bark, etc.) and buzzer sounds. Since it is known in advance that the buzzer sound is a periodic signal and the random noise is an irregular but stationary noise, it is possible to perform highly accurate noise cancellation by using these information.

【００３３】また、音声応答部から出力される信号が、
広帯域雑音（白色雑音）である場合は、スピーカ８から
マイクロホン１までの伝達関数Ｗの推定が容易であるこ
とが知られている。即ち、音声信号の有声音（母音等）
は、周期信号であり、しかも、非定常性を有するので、
短時間周波数スペクトルは線スペクトルとなる。このた
め、スペクトル成分が広帯域にあるわけではなく、イン
パルス応答の推定精度を悪化させている。そこで、図６
に示した構成とすれば、音声メッセージ以外に音声応答
の周波数成分のないところに雑音や音楽等の広帯域信号
を付加することができ、ＬＭＳ及びＦＬＭＳアルゴリズ
ムの精度を向上させることができる。The signal output from the voice response unit is
It is known that in the case of wide band noise (white noise), the transfer function W from the speaker 8 to the microphone 1 can be easily estimated. That is, voiced sounds (vowels, etc.) of voice signals
Is a periodic signal and has non-stationarity,
The short-time frequency spectrum becomes a line spectrum. For this reason, the spectrum component is not in a wide band, and the estimation accuracy of the impulse response is deteriorated. Therefore, FIG.
With the configuration shown in (1), a wideband signal such as noise or music can be added to a place where there is no frequency component of the voice response other than the voice message, and the accuracy of the LMS and FLMS algorithms can be improved.

【００３４】次に、本発明の第２の実施例について説明
する。上述した第１実施例では、当該音声対話システム
へのユーザの音声入力があった場合に、インパルス応答
の推定精度が著しく低下することが知られている。そこ
で、第２実施例では、図８に示すように伝達関数更新制
御部１５を設け、推定精度を向上させる。以下、この動
作について説明する。Next, a second embodiment of the present invention will be described. In the above-described first embodiment, it is known that the accuracy of impulse response estimation is significantly reduced when the user inputs a voice into the voice interaction system. Therefore, in the second embodiment, the transfer function update control unit 15 is provided as shown in FIG. 8 to improve the estimation accuracy. Hereinafter, this operation will be described.

【００３５】まず、インパルス応答をＬＭＳ／ニュート
ンアルゴリズムを用いて推定する際に、過去のインパル
ス応答を例えば１００［ｍｓ］毎に５秒間だけ保持す
る。First, when the impulse response is estimated by using the LMS / Newton algorithm, the past impulse response is held for 5 seconds, for example, every 100 [ms].

【００３６】即ち、Ｗ₀…現在Ｗ_-1…１００［ｍｓ］前Ｗ_-2…２００［ｍｓ］前 ……………………… Ｗ_-50…５［秒］前の各伝達関数が記憶される。そして、図１に示した音声
認識部５において、ユーザの音声が検出された場合に
は、インパルス応答の設定を音声発話以前のものに変更
する。つまり、たとえば７５０［ｍｓ］だけ前にユーザ
からの音声が入力された場合には、８００［ｍｓ］前の
インパルス応答Ｗ_-8がＷ₀に変わって逐次処理に使用さ
れるのである。また、この動作を図７に示すタイムチャ
ートに基づいて説明する。That is, each transfer function before W ₀ ... present W _-1 ... 100 [ms] before W _-2 ... 200 [ms] ... ...... ............ W _-50 ... 5 [seconds] before Remembered. Then, in the voice recognition unit 5 shown in FIG. 1, when the voice of the user is detected, the setting of the impulse response is changed to that before the voice utterance. That is, for example, when a voice is input from the user 750 [ms] ago, the impulse response W _-8 800 [ms] ago is changed to W ₀ and used for sequential processing. Further, this operation will be described based on the time chart shown in FIG.

【００３７】同図に示す曲線Ｓ₄は音声応答信号であ
り、曲線Ｓ₅はユーザの発話信号である。そして、音声
応答除去部２で１００［ｍｓ］毎にインパルス応答を更
新しながら音声応答を除去し、音声認識部５でユーザの
発話を検出して発話の始点ｔ_S、終点ｔ_Eを検出する。
また、ユーザの発話を検出した場合には図８に示すイン
パルス応答更新制御部１５により、インパルス応答の推
定値Ｗ₀を更新するか、過去の推定値Ｗ_i（ｉ＝−１〜
−５０）を用いるかを１００［ｍｓ］毎に判定する。こ
れによって、アダプティブフィルタ３では、より精度の
良いインパルス応答を得ることができるので、音声応答
の除去効率が向上する。The curve S ₄ shown in the figure is the voice response signal, and the curve S ₅ is the speech signal of the user. Then, the voice response removal unit 2 removes the voice response while updating the impulse response every 100 [ms], and the voice recognition unit 5 detects the user's utterance and detects the start point t _S and the end point t _E of the utterance. .
When the user's utterance is detected, the impulse response update control unit 15 shown in FIG. 8 updates the estimated value W ₀ of the impulse response, or the estimated value W _i (i = −1 to past)
−50) is used for every 100 [ms]. As a result, the adaptive filter 3 can obtain a more accurate impulse response, which improves the removal efficiency of the voice response.

【００３８】また、上述した各実施例では、音声応答を
生成するために音声規則合成を行なっており、以下この
音声合成に必要な一連の内部情報（例えば、ピッチ，パ
ワーの時系列）から精度の良いインパルス応答を推定す
るための方法について図５，図４を参照しながら説明す
る。図５は、「取消します（ｔｏｒｉｋｅｓｈｉｍａｓ
ｕ）」という音声応答を合成する場合のパワーとピッチ
の時間変化を示す図である。また、図４はＦＬＭＳの集
束係数を求める際のフローチャートである。ただしＦＬ
ＭＳではインパルス応答の周波数スペクトルである伝達
関数の推定を行う。Further, in each of the above-described embodiments, the voice rule synthesis is performed to generate the voice response, and hereinafter, the accuracy is calculated from a series of internal information (eg, time series of pitch and power) necessary for this voice synthesis. A method for estimating a good impulse response will be described with reference to FIGS. Figure 5 shows "Cancel (Torikeshimas)
FIG. 7 is a diagram showing temporal changes in power and pitch when a voice response “u)” is synthesized. Further, FIG. 4 is a flowchart for obtaining the FLMS focusing coefficient. However FL
The MS estimates the transfer function, which is the frequency spectrum of the impulse response.

【００３９】まず、時刻ｎ＝０において、図５（ａ）に
示すパワー情報から無音区間であるかどうかを判定する
（ステップＳＴ１１）。そして無音であると判定された
場合（ステップＳＴ１１でＹＥＳ）にはＦＬＭＳの集束
係数μ（ｆ）をすべての周波数において「０」とおく
（ステップＳＴ１４）。これによって、伝達関数の推定
値は適応推定によっても変化しなくなるため、無音区間
で雑音がマイクロホン１から入力されても伝達関数の推
定値は影響を受けない。First, at time n = 0, it is determined from the power information shown in FIG. 5A whether or not there is a silent section (step ST11). When it is determined that the sound is silent (YES in step ST11), the FLMS focusing coefficient μ (f) is set to “0” at all frequencies (step ST14). As a result, the estimated value of the transfer function does not change even by the adaptive estimation, so that the estimated value of the transfer function is not affected even if noise is input from the microphone 1 in the silent section.

【００４０】一方、無音でないと判定された場合には
（ステップＳＴ１１でＮＯ）、音韻が子音であるか母音
であるかが判定される（ステップＳＴ１２）。この判定
は現在の音韻が既知であるため容易に行なえる。On the other hand, when it is determined that the phoneme is not silent (NO in step ST11), it is determined whether the phoneme is a consonant or a vowel (step ST12). This determination can be easily performed because the current phoneme is known.

【００４１】そして、子音であると判定された場合（ス
テップＳＴ１２で「子音」側）には、更にそのパワーが
しきい値（例えば、周囲の環境雑音レベル＋２０ｄＢ）
以上であるか否かが判定される（ステップＳＴ１５）。
そして、しきい値以下の場合（ステップＳＴ１５でＮ
Ｏ）にはすべての周波数についてμ（ｆ）＝０とする
（ステップＳＴ１６）。また、しきい値以上の場合はす
べての周波数においてμ（ｆ）＝ａ（ａは所定の集束係
数）とする（ステップＳＴ１７）。If it is determined that the sound is a consonant ("consonant" side in step ST12), the power is further set to a threshold value (for example, ambient environmental noise level +20 dB).
It is determined whether or not the above is true (step ST15).
And, when it is less than the threshold value (N in step ST15)
In (O), μ (f) = 0 is set for all frequencies (step ST16). If it is equal to or more than the threshold value, μ (f) = a (a is a predetermined focusing coefficient) is set at all frequencies (step ST17).

【００４２】一方、音韻が母音である場合（ステップＳ
Ｔ１２で「母音」側）には、そのパワーがしきい値以上
であるか否かが判定される（ステップＳＴ１３）。そし
て、しきい値以下の場合（ステップＳＴ１３でＮＯ）に
は、すべての周波数についてμ（ｆ）＝０とする（ステ
ップＳＴ１８）。On the other hand, if the phoneme is a vowel (step S
On the "vowel" side in T12), it is determined whether or not the power is equal to or higher than a threshold value (step ST13). Then, if it is equal to or less than the threshold value (NO in step ST13), μ (f) = 0 is set for all frequencies (step ST18).

【００４３】また、しきい値以上の場合（ステップ１３
でＹＥＳ）には、例えば、ピッチ周波数ｆ_pの整数倍の
周波数のまわり±（１／３）ｆ_pの範囲で、μ（ｆ）＝
ａとする。また、この範囲外ではμ（ｆ）＝０とする
（ステップＳＴ１９）。即ち、次の（５）式である。If the threshold value is exceeded (step 13)
YES), for example, in the range of ± (1/3) f _p around a frequency that is an integer multiple of the pitch frequency f _p , μ (f) =
a. Further, outside this range, μ (f) = 0 is set (step ST19). That is, it is the following expression (5).

【００４４】[0044]

【数３】 μ（ｆ）＝ａ（ｆ_p・ｎ−１／３ｆ_p＜ｆ＜ｆ_p・ｎ＋１／３ｆ_p） μ（ｆ）＝０（上記以外） …（５）そして、上述した操作を例えば１０［ｍｓ］毎にくり返
す（ステップＳＴ２０）。[Mathematical formula-see original document] μ (f) = a (f _p · n−1 / 3f _p <f <f _p · n + 1 / 3f _p ) μ (f) = 0 (other than the above) (5) Then, the above-mentioned operation Is repeated every 10 [ms], for example (step ST20).

【００４５】このようにして、音声応答の信号のうちパ
ワーの大きい周波数成分を重視して伝達関数推定値の更
新を行なうため、高精度の推定が可能である。In this way, since the transfer function estimation value is updated by emphasizing the frequency component having a large power in the voice response signal, highly accurate estimation is possible.

【００４６】次に本発明の第３実施例について説明す
る。前記したＬＭＳ／ニュートンアルゴリズムによる伝
達関数推定では、音声のような非定常信号を入力とした
場合には推定精度が変化し、推定動作が不安定になるこ
とが知られている。しかし、対話システムでは合成音声
を入力とした場合でも安定なインパルス応答推定が必要
である。そこで、以下では入力信号に大きなパワー変動
がある場合でも高精度のインパルス応答を安定に求める
方法を説明する。Next, a third embodiment of the present invention will be described. In the transfer function estimation by the LMS / Newton algorithm described above, it is known that the estimation accuracy changes and the estimation operation becomes unstable when a non-stationary signal such as voice is input. However, the interactive system needs stable impulse response estimation even when synthetic speech is input. Therefore, a method of stably obtaining a highly accurate impulse response even if the input signal has a large power fluctuation will be described below.

【００４７】図９は第３実施例の構成を示すブロック図
であり、図１に示した音声応答除去部２の内部構成を示
している。図示のように、この音声応答除去部２は、合
成入力側（音声応答）、及びマイク入力側にそれぞれ設
けられたＡ／Ｄ変換器３１，３２と、音声応答信号パワ
ーを平滑化する第１の平滑化フィルタ３３、第２の平滑
化フィルタ３４と、各平滑化フィルタの出力信号を基に
適応化を行なうか否かを判定する適応・停止切換部３５
と、アダプティブフィルタ３と、たたみ込み演算部３６
と、減算部４から構成されている。FIG. 9 is a block diagram showing the structure of the third embodiment, showing the internal structure of the voice response removing unit 2 shown in FIG. As shown in the figure, the voice response removing unit 2 includes A / D converters 31 and 32 respectively provided on a synthetic input side (voice response) and a microphone input side, and a first smoothing voice response signal power. Smoothing filter 33, second smoothing filter 34, and adaptive / stop switching unit 35 for determining whether or not to perform the adaptation based on the output signals of the respective smoothing filters.
, Adaptive filter 3, and convolution operation unit 36
And a subtracting section 4.

【００４８】第１の平滑化フィルタ３３は、時定数が小
さく設定されており、例えば時定数ｔ₁は１０［ｍｓ］
である。The time constant of the first smoothing filter 33 is set small, for example, the time constant t ₁ is 10 [ms].
Is.

【００４９】第２の平滑化フィルタ３４は、時定数が大
きく設定されており、例えば時定数ｔ₂は１００［ｍ
ｓ］である。The time constant of the second smoothing filter 34 is set large, and for example, the time constant t ₂ is 100 [m.
s].

【００５０】適応・停止切換部３５は、前記第１の平滑
化フィルタ３３の出力が所定のしきい値Ｖ_a以下となっ
た場合にアダプティブフィルタ３による適応化を停止さ
せ、第２の平滑化フィルタ３４の出力が所定のしきい値
Ｖ_b以上となったときに適応化を開始させるように動作
する。The adaptive / stop switching unit 35 stops the adaptation by the adaptive filter 3 when the output of the first smoothing filter 33 becomes equal to or lower than _a predetermined threshold value V _a , and the second smoothing is performed. It operates so as to start the adaptation when the output of the filter 34 exceeds a predetermined threshold value V _b .

【００５１】図１３は、「どうぞ」という音声のパワー
情報を示しており、同図（ａ）は第１の平滑化フィルタ
３３の出力、そして、同図（ｂ）は第２の平滑化フィル
タ３４の出力を示している。なお、時定数の違いから第
２の平滑化フィルタ３４の出力信号の方が滑らかになっ
ていることは言うまでもない。FIG. 13 shows the power information of the voice "Please". The figure (a) shows the output of the first smoothing filter 33, and the figure (b) shows the second smoothing filter. The output of 34 is shown. Needless to say, the output signal of the second smoothing filter 34 is smoother due to the difference in time constant.

【００５２】図１４は、「どうぞ」という音声出力中で
音がとぎれた点付近の各フィルタ３３，３４の出力を重
ねた図である。通常、無音部分と音声部分との亘りの部
分のように音声のパワーが大きく変化したときに伝達関
数の推定精度がわずかの時間内、例えば１［ｍｓｅｃ］
の間に急激に低下する。従って、音声のパワーが大きく
変化したときにはす早く適応化を停止することによっ
て、高い推定精度を維持することができる。そこで、図
１４に示す如く、第１の平滑化フィルタ３３の出力Ｐ_a
（ｔ）がしきい値Ｖ_a以下となったときに適応化を停止
し、第２の平滑化フィルタ３４の出力Ｐ_b（ｔ）がしき
い値Ｖ_b以上となったときに適応化を開始すれば、音声
のパワーが大きく変化したときの適応化は行なわれな
い。これによって、高い推定精度を維持することができ
る。FIG. 14 is a diagram in which the outputs of the filters 33 and 34 near the point where the sound is interrupted in the voice output "Please" are overlapped. Usually, when the power of the voice changes greatly like the part between the silent part and the voice part, the estimation accuracy of the transfer function is within a short time, for example, 1 [msec].
Falls sharply during. Therefore, a high estimation accuracy can be maintained by stopping the adaptation as soon as the power of the voice changes significantly. Therefore, as shown in FIG. 14, the output P _{a of} the first smoothing filter 33 is
The adaptation is stopped when (t) becomes equal to or _smaller than the threshold value V _a, and the adaptation is performed when the output P _b (t) of the second smoothing filter 34 becomes equal to or larger than the threshold value V _b. Once started, no adaptation is done when the power of the voice changes significantly. This makes it possible to maintain high estimation accuracy.

【００５３】図１０は「いらっしゃいませ」という合成
音声を入力したときのインパルス応答の推定結果を示し
ており、曲線Ｓ１１は、上記した適応化推定停止を行な
った場合、曲線Ｓ１２は行なわない場合の推定結果であ
る。同図から明らかなように、停止を行なうほうが高精
度にインパルス応答を推定できることが理解される。FIG. 10 shows the estimation result of the impulse response when the synthetic speech "Welcome" is input. The curve S11 shows the case where the adaptive estimation stop is performed and the curve S12 does not. It is an estimation result. As is clear from the figure, it is understood that the impulse response can be estimated with higher accuracy by performing the stop.

【００５４】図１１は、応答除去後の音声の認識結果で
ある。図から明らかなようにインパルス応答精度が高い
程、すなわち合成音除去量が大きい程音声認識率は高く
なり、合成音声除去の効果が理解される。また、認識方
式は、上記キーワードスポッティングと雑音免疫学習の
組み合わせに限る必要はなく、単語音声認識やＨＭＭに
よる連続音声認識方式でも良い。FIG. 11 shows the recognition result of the voice after the response is removed. As is clear from the figure, the higher the impulse response accuracy, that is, the larger the amount of synthesized speech removed, the higher the speech recognition rate, and the effect of the synthesized speech removal can be understood. Further, the recognition method is not limited to the combination of the keyword spotting and the noise immunity learning described above, and may be word speech recognition or continuous speech recognition method by HMM.

【００５５】図１５は第３実施例においてフィルタ更新
の係数であるステップゲインを求める際の動作を示すフ
ローチャートである。FIG. 15 is a flow chart showing the operation for obtaining the step gain which is the coefficient for updating the filter in the third embodiment.

【００５６】まず、時刻ｋ＝０において（ステップＳＴ
３１）、第１の平滑化フィルタ３３の出力パワーｐ
_a（ｋ）がしきい値Ｖ_a（例えばＶ_a＝合成音の平均パ
ワーである−２０ｄＢ）以下であるか否かを判定する
（ステップＳＴ３２）。そして、しきい値Ｖ_a以下であ
ると判定された場合には（ステップＳＴ３２でＹＥ
Ｓ）、ＬＭＳのμを０として（ステップＳＴ３６）伝達
関数の更新を行なわないようにする。これは、前記した
（４）式から容易に理解され、集束係数μ＝０の際には
Ｗ_kは更新されない。First, at time k = 0 (step ST
31), the output power p of the first smoothing filter 33
It is determined whether or not _a (k) is less than or equal to a threshold value V _a (for example, V _a = −20 dB which is the average power of the synthesized sound) (step ST32). If it is determined that the threshold value is equal to or lower than the threshold value V _a (YES in step ST32).
S), μ of LMS is set to 0 (step ST36) so that the transfer function is not updated. This is easily understood from the above equation (4), and W _k is not updated when the focusing coefficient μ = 0.

【００５７】一方、パワーｐ_a（ｋ）がしきい値Ｖ_a以
上であると判定されると（ステップＳＴ３２でＮＯ）、
次に第２の平滑化フィルタ３４の出力パワーｐ_b（ｋ）
がしきい値Ｖ_b以下であるか否かを判定する（ステップ
ＳＴ３３）。そして、しきい値Ｖ_b以下であると判定さ
れた場合には（ステップＳＴ３３でＹＥＳ）、集束係数
μ＝０（ステップＳＴ３７）として伝達関数の更新を行
なわない。すなわち、図１４における「停止」の部分を
示している。On the other hand, when it is determined that the power p _a (k) is the threshold value V _a or more (NO in step ST32),
Next, the output power p _b (k) of the second smoothing filter 34
Is less than or equal to the threshold V _b (step ST33). If it is determined that the threshold value is equal to or lower than the threshold value V _b (YES in step ST33), the focusing function μ is set to 0 (step ST37) and the transfer function is not updated. That is, it shows the "stop" portion in FIG.

【００５８】そして、パワーｐ_b（ｋ）がしきい値Ｖ_b
以上となると（ステップＳＴ３３でＮＯ）、ステップゲ
インを以下の（５）式で求める。Then, the power p _b (k) is equal to the threshold V _b.
When the above is reached (NO in step ST33), the step gain is calculated by the following equation (5).

【００５９】[0059]

【数４】ステップゲイン＝２μ・ｅ（ｋ）／｛ｐ_b（ｔ）・Ｌ｝ …（５）こうして、伝達関数の更新が行なわれるのである。## EQU00004 ## Step gain = 2 .mu..e (k) / { _p.sub.b (t) .L} (5) Thus, the transfer function is updated.

【００６０】このようにして、第３実施例では、音声信
号のパワーが低減した場合には、適応化を停止させるの
で、高い推定精度を維持することが可能である。In this way, in the third embodiment, when the power of the audio signal is reduced, the adaptation is stopped, so that high estimation accuracy can be maintained.

【００６１】なお、この実施例では平滑化フィルタを２
個設ける構成としたが、特にこれに限定されるものでは
なく、１、又は３以上の平滑化フィルタを用いても構成
可能であることは自明である。In this embodiment, the smoothing filter is set to 2
Although the configuration is provided individually, the configuration is not particularly limited to this, and it is obvious that one, or three or more smoothing filters can be used.

【００６２】また、伝達関数の推定を行なう際には、適
応フィルタの入力信号である合成音声と希望出力である
マイクロホン信号とが常に一定の時間差をもって得られ
ることが必要である。すなわちマイクロホン信号中の合
成音成分は、スピーカから出力された合成音とは音響伝
達系の伝播遅延分だけ時間差があり、伝達関数推定の際
はこれが保存されている必要がある。入力信号の合成音
声を計算機内部から直接得る場合には、計算機の負荷の
具合や思わぬ誤動作により、計算機内部に持っている合
成音声が期待したタイミングでスピーカから出力されな
い場合が考えられる。このような場合にも安定に伝達関
数推定を行なうため、図９に示すように２ｃｈのＡ／Ｄ
変換器３１，３２によってマイクロホン信号と合成音声
信号とを得ることにより、一定のタイミングで２つの信
号を得ることが可能である。Further, when estimating the transfer function, it is necessary that the synthetic speech which is the input signal of the adaptive filter and the microphone signal which is the desired output are always obtained with a constant time difference. That is, the synthetic sound component in the microphone signal has a time difference from the synthetic sound output from the speaker by the propagation delay of the acoustic transfer system, and this must be preserved when the transfer function is estimated. When the synthesized voice of the input signal is directly obtained from the inside of the computer, the synthesized voice held inside the computer may not be output from the speaker at an expected timing due to the load of the computer or an unexpected malfunction. Even in such a case, since the transfer function is estimated stably, as shown in FIG.
By obtaining the microphone signal and the synthesized voice signal by the converters 31 and 32, it is possible to obtain two signals at a fixed timing.

【００６３】また、伝達関数推定は計算量が多いため、
実時間で計算を終えるためにＤＳＰボードを用いて音声
応答除去部を構成できる。Since the transfer function estimation requires a large amount of calculation,
The voice response removal unit can be configured using a DSP board to finish the calculation in real time.

【００６４】図１２は合成音声除去装置付きの音声対話
システムの外観である。利用者はマイクロホン２３に向
かって音声を入力し、システムの合成音声応答がスピー
カ２１から出力される。上記ＡＤ変換装置は音声信号の
帯域を考慮して１２［ｋHz］のサンプリング周波数を使
用している。利用者はモニタ２２の補助情報を見ながら
対話を進めていくが、合成音除去装置によって合成音声
が打ち消されており、音声認識装置には利用者の音声だ
けが入力されるので、利用者はシステムが応答中でも割
り込んで音声を入力することができる。このとき、マイ
クロホンはスピーカからの合成音声をなるべく拾わない
ように指向性のものを用いても良いが、周囲の壁からの
反射音は残ってしまうため、指向性マイクホンの使用の
みでは合成音声を消すことはできない。又、入力音声の
ＳＮ比を良くするためになるべくマイクロホンの近く、
例えばマイクロホンから３０cm以内程度の距離で発声す
るのが望ましいが、ユーザの体に反射した合成音がマイ
クロホンに入ってしまうことになる。この大きさはユー
ザとマイクロホンが近いために反射音の中で最もレベル
が大きく、且つ体の動きによって振幅と時間遅れが変化
する。以上のような場合でも適応フィルタによって伝達
関数を更新しているので周囲の壁による反射やユーザの
動き、あるいは他の人々の動きによる伝達関数の変化に
追随することができ効果的に合成音を除去することがで
きる。FIG. 12 is an external view of a voice dialogue system equipped with a synthetic voice removing device. The user inputs a voice into the microphone 23, and the synthesized voice response of the system is output from the speaker 21. The AD converter uses a sampling frequency of 12 [kHz] in consideration of the band of the audio signal. The user proceeds with the dialogue while watching the auxiliary information on the monitor 22, but since the synthesized voice is canceled by the synthesized voice removing device and only the voice of the user is input to the voice recognition device, the user is You can interrupt and input voice even when the system is responding. At this time, the microphone may be a directional one so as not to pick up the synthesized voice from the speaker as much as possible, but the reflected sound from the surrounding wall remains, so the synthetic voice can be obtained only by using the directional microphone. It cannot be erased. Also, in order to improve the SN ratio of the input voice, as close to the microphone as possible,
For example, it is desirable to speak within a distance of about 30 cm from the microphone, but the synthesized sound reflected by the user's body will enter the microphone. This level has the highest level in the reflected sound because the user and the microphone are close to each other, and the amplitude and time delay change depending on the movement of the body. Even in the above cases, since the transfer function is updated by the adaptive filter, changes in the transfer function due to reflections from surrounding walls, user movements, or other people's movements can be tracked effectively, and a synthetic sound can be effectively generated. Can be removed.

【００６５】次に、本発明の第４実施例について説明す
る。これは、システムが誤って合成音を検出してしまう
ことを防止する例である。Next, a fourth embodiment of the present invention will be described. This is an example of preventing the system from erroneously detecting a synthetic sound.

【００６６】図１６は該第４実施例の構成を示すブロッ
ク図である。図示のように、この音声対話システムは減
算器４の出力側に音声検出部３１が設けられている。FIG. 16 is a block diagram showing the structure of the fourth embodiment. As shown in the figure, the voice interaction system is provided with a voice detection unit 31 on the output side of the subtractor 4.

【００６７】音声検出部３１は、減算器４の出力信号と
背景雑音及び除去されるべき合成音が誤って残ってしま
った信号を基に音声入力があったか否かを判定するもの
であり、図１７に示すように、検出しきい値決定部３２
と、音声判定部３３と、インパルス応答推定部３４から
構成されている。The voice detection unit 31 determines whether or not there is voice input based on the output signal of the subtracter 4, the background noise, and the signal in which the synthesized voice to be removed remains by mistake. As shown in FIG.
And a voice determination unit 33 and an impulse response estimation unit 34.

【００６８】インパルス応答推定部３４は、スピーカ８
とマイクロホン１間のインパルス応答を推定し、これを
検出しきい値決定部３２に供給する。The impulse response estimation unit 34 is arranged in the speaker 8
The impulse response between the microphone 1 and the microphone 1 is estimated, and this is supplied to the detection threshold value determination unit 32.

【００６９】検出しきい値決定部３２は、前記インパル
ス応答とスピーカ８から出力される合成音声を基に、減
算器４の出力が音声入力であるか否かを判定するための
しきい値を決定する。The detection threshold value determining unit 32 sets a threshold value for determining whether or not the output of the subtractor 4 is a voice input based on the impulse response and the synthesized voice output from the speaker 8. decide.

【００７０】音声判定部３３は、後述するようにしきい
値を越えた信号の継続時間等に基づいて入力信号が音声
入力であるか否かを判定するものである。The voice determination unit 33 determines whether or not the input signal is a voice input, based on the duration of the signal exceeding the threshold value and the like, as will be described later.

【００７１】以下、図１８を用いて具体的に説明する。
同図は音声検出に使う検出パラメータの例を表したもの
で、音声の始端をＡ、終端をＢで表してある。予め背景
雑音パワーＰｏを測定し、これに始端決定用のマージン
Ｍｓ、例えば５ｄＢを加えた値を始端検出しきい値Ｐ
ｓ、終端決定用マージンＭｅ、例えば３ｄＢを加えた値
を終端検出しきい値Ｐｅと定める。また、始端決定用の
音声持続時間Ｔｓを例えば２０ｍｓ、終端決定用の無音
持続時間Ｔｅを例えば２００ｍｓ、最小音声持続時間Ｔ
ｖを例えば２００ｍｓと定める。A detailed description will be given below with reference to FIG.
This figure shows an example of the detection parameters used for voice detection, where the beginning of the voice is A and the end is B. The background noise power Po is measured in advance, and a value obtained by adding a margin Ms for determining the starting point, for example, 5 dB to the starting point detection threshold P
A value obtained by adding s and the margin Me for determining the end, for example, 3 dB is defined as the end detection threshold Pe. Also, the voice duration Ts for determining the start edge is, for example, 20 ms, the silent duration Te for determining the end is, for example, 200 ms, and the minimum voice duration T.
For example, v is set to 200 ms.

【００７２】そして、入力信号パワーの計算をある時間
間隔、例えば１０ｍｓ毎に行い、新しい値が得られる度
に検出しきい値との比較を行いながら、例えば図１９の
状態遷移図に従って検出状態の遷移を行い、音声検出を
行うことができる。時間はパワー計算時間間隔の倍数で
表すことにし、図１９で始端Ａから測った時間をｎｓ、
終端から測った時間をｎｅとしてある。また、時刻を
ｉ、時刻ｉにおけるパワーをＰｉで表してある。また、
矢印は状態の遷移先を示し、矢印の傍らの式は遷移条件
を表している。状態数は６個であり、音声が入力されて
いない状態を表す無音状態（Ｓ０）、仮の始端が定まっ
た状態を表す始端仮定状態（Ｓ１）、始端が確定した状
態を表す始端確定状態（Ｓ２）、音声であることが確定
していることを表す音声確定状態（Ｓ３）、仮の終端が
定まった状態を表す終端仮定状態（Ｓ４）、音声がまだ
継続していることを表す音声継続状態（Ｓ５）、終端が
確定し、音声検出が終了した状態を表す終端確定状態
（Ｓ６）がある。Then, the input signal power is calculated at a certain time interval, for example, every 10 ms, and while comparing with the detection threshold value each time a new value is obtained, for example, the state of the detected state is changed according to the state transition diagram of FIG. Transitions can be made and voice detection can be performed. The time is represented by a multiple of the power calculation time interval, and the time measured from the starting point A in FIG. 19 is ns,
The time measured from the end is ne. Further, time is represented by i, and power at time i is represented by Pi. Also,
The arrow indicates the transition destination of the state, and the expression beside the arrow indicates the transition condition. The number of states is 6, and there is a silent state (S0) that represents a state in which no voice is input, a start-end assumed state (S1) that represents a state in which a provisional start end is defined, and a start-end defined state (in which a start end is defined) S2), a voice confirmation state (S3) indicating that the voice is confirmed, an end assumption state (S4) indicating a state in which the temporary end is determined, and a voice continuation indicating that the voice is still continuing. There is a state (S5), a termination confirmed state (S6) that represents a state in which the termination is determined and voice detection is completed.

【００７３】まず、音声入力がない場合は無音（Ｓ０）
の状態にあり、ある時刻ｉ_sでパワーＰｉが始端検出し
きい値Ｐｓを越えると時刻ｉ_sを仮の始端と定め、始端
仮定状態（Ｓ１）へと遷移する。Ｐｓを越えない場合は
無音状態（Ｓ０）のままである。First, when there is no voice input, there is no sound (S0).
Is in the state, the power Pi at a certain time i _s are defined time i _s the start of the temporary exceeds leading end detection threshold Ps, transitions to start assumed state (S1). When it does not exceed Ps, it remains silent (S0).

【００７４】始端仮定状態（Ｓ１）になった時刻からｎ
ｓを測りはじめ、パワーが始端検出しきい値Ｐｓを越え
たままｎｓが始端決定用の音声持続時間Ｔｓ以上になっ
た場合には時刻ｉ_sを始端であると定めて始端確定状態
（Ｓ２）へと遷移する。時間Ｔｓが経過するまでは始端
仮定状態（Ｓ１）でいる。時間がＴｓに達する前にパワ
ーが始端検出しきい値Ｐｓを下回った場合には無音状態
（Ｓ０）へと遷移する。次いで、始端確定状態（Ｓ２）
においてパワーがＰｓ以上のまま時間ｎｓが最小音声持
続時間Ｔｖ以上になった場合には時刻ｉ_sから現在まで
の入力信号が音声であるとみなし、音声確定状態（Ｓ
３）へと遷移する。Ｔｖに達する前にパワーがＰｓを下
回った場合には無音状態（Ｓ０）へと遷移する。From the time when the starting end assumption state (S1) is reached, n
Introduction Measure s, defines the time i _s to be starting beginning commit state if the power ns while beyond leading end detection threshold Ps is equal to or higher than the sound duration Ts for starting determination (S2) Transition to. Until the time Ts elapses, the starting end assumption state (S1) is maintained. If the power falls below the start edge detection threshold Ps before the time reaches Ts, the state transitions to the silent state (S0). Next, the start end is determined (S2)
Power regarded as input signals from the time i _s to date when leave time ns than Ps is equal to or greater than the minimum sound duration Tv is audio in, audio definite state (S
Transition to 3). When the power falls below Ps before reaching Tv, the state transits to the silent state (S0).

【００７５】そして、音声確定状態（Ｓ３）においてパ
ワーがＰｅを下回った場合にはこのときの時刻ｉ_eが終
端であると仮定し、終端仮定状態（Ｓ４）へと遷移す
る。時刻ｉ_eから終端決定用の時間長パラメータｎｅを
測り始める。パワーがＰｅ以上の場合には音声確定状態
（Ｓ３）のままである。その後、終端仮定状態（Ｓ４）
においてパワーがＰｅを下回ったままｎｅが終端決定用
の無音持続時間Ｔｅ以上となった場合には終端が決定し
たものとし、終端決定状態（Ｓ６）へ遷移して検出処理
を終了する。Ｔｅに達する前にパワーＰがＰｅ以上とな
った場合には音声継続状態（Ｓ５）へと遷移する。次い
で、音声継続状態（Ｓ５）おいてパワーＰｉがＰｅを下
回った場合にはこのときの時刻ｉ_e′が終端であると仮
定し、終端仮定状態（Ｓ４）へと遷移する。パワーがＰ
ｅ以上の場合には音声継続状態（Ｓ５）のままである。
こうして、音声入力が認識されるのである。Then, when the power is lower than Pe in the voice fixed state (S3), it is assumed that the time i _{e at} this time is the end, and the state is changed to the end presumed state (S4). The time length parameter ne for determining the end is started from time i _e . If the power is Pe or higher, the voice remains in the determined state (S3). After that, assume end state (S4)
If the ne becomes equal to or longer than the silent duration Te for determining the end while the power is lower than Pe, it is determined that the end has been determined, the state is changed to the end determination state (S6), and the detection process is ended. If the power P becomes Pe or more before reaching Te, the state transitions to the voice continuation state (S5). Next, when the power Pi is lower than Pe in the voice continuation state (S5), it is assumed that the time i _e ′ at this time is the termination, and the transition to the termination assumed state (S4) is made. Power is P
In the case of e or more, the voice continuation state (S5) remains.
In this way, the voice input is recognized.

【００７６】次に音声応答があるとき、即ち、スピーカ
８からの音声応答が完全に除去されないときの音声検出
の方法について説明する。音声応答が出力されている場
合には合成音の分だけ入力信号レベルが上がるので、検
出しきい値をその分上げておくことによって誤った音声
検出をなくすことができる。高いレベルの合成音が入力
されても検出されないように、安全のためにしきい値の
上げ幅を大きな一定値で不変の値とすると、音声応答が
ない場合の検出性能を低下させることになる。したがっ
て、常に検出性能を高く保つには、応答音声のパワーに
応じて最低限の上げ幅でしきい値を毎時設定することが
望ましい。以下に図２０のタイムチャートを使って音声
応答のパワーに応じたしきい値設定方法を説明する。Next, a method of detecting a voice when there is a voice response, that is, when the voice response from the speaker 8 is not completely removed will be described. When a voice response is output, the input signal level rises by the amount of the synthesized voice, and thus the false detection of the voice can be eliminated by raising the detection threshold value accordingly. For safety reasons, if the threshold value is increased with a large constant value so as not to be detected even when a high-level synthetic speech is input, the detection performance in the case of no voice response is deteriorated. Therefore, in order to keep the detection performance high at all times, it is desirable to set the threshold value every hour in accordance with the power of the response voice with a minimum increment. The threshold value setting method according to the power of the voice response will be described below with reference to the time chart of FIG.

【００７７】まず、音声入力がない状態で、背景雑音パ
ワーＰｏの測定（ステップＳＴ４１）、及び、一定時
間、例えば３秒間合成音を出力してスピーカ−マイクロ
ホン間のインパルス応答推定を行う（ステップＳＴ４
２）。インパルス応答推定は応答音声除去部２で行って
いるのでその結果を使うことができ、新たに推定部を設
ける必要はない（ステップＳＴ４３）。次に推定したイ
ンパルス応答に音声応答信号を畳み込んでマイクロホン
信号中の合成音成分とそのパワーＰｓを求める（ステッ
プＳＴ４４）。合成音パワーＰｓと背景雑音パワーＰｏ
との和Ｐを音声検出のベースレベルＰｂとおくことによ
って合成音パワーに応じたしきい値設定を行うことがで
きる（ステップＳＴ４５）。時間ｉ＝０以後、パワー計
算は一定時間間隔、例えば１０ｍｓ毎に行うことにより
計算量を減らすことができ、その際応答音声除去部２で
推定された新しいインパルス応答を使うことによって音
響系の変化にも対応できる。合成音は音声応答除去部２
によって消去されているので、音声応答パワーの推定値
Ｐｓはもっと小さい値にすることも可能であるが、音響
系が変化している場合はインパルス応答の推定が音響系
の変化に追随できずに消去率が小さくなることもあるの
でＰｓをそのまま使うのが安全である。First, in the absence of voice input, the background noise power Po is measured (step ST41), and a synthesized sound is output for a fixed time, for example, 3 seconds to estimate the impulse response between the speaker and the microphone (step ST4).
2). Since the impulse response estimation is performed by the response voice removing unit 2, the result can be used and it is not necessary to newly provide an estimating unit (step ST43). Next, the voice response signal is convoluted with the estimated impulse response to obtain the synthetic sound component in the microphone signal and its power Ps (step ST44). Synthetic sound power Ps and background noise power Po
By setting the sum P of and P as the base level Pb for voice detection, it is possible to set the threshold value according to the synthesized voice power (step ST45). After time i = 0, the power calculation can be reduced by performing the power calculation at regular time intervals, for example, every 10 ms. At this time, the new impulse response estimated by the response voice removing unit 2 is used to change the acoustic system. Can also be used. The synthesized voice is the voice response removal unit 2
The estimated value Ps of the voice response power can be set to a smaller value because it has been deleted by, but when the acoustic system is changing, the impulse response estimation cannot follow the change of the acoustic system. Since the erasing rate may decrease, it is safe to use Ps as it is.

【００７８】次にインパルス応答推定を高精度に行う例
について説明する。適応フィルタの入力である音声信号
は周波数スペクトルが平坦でないため、ＬＭＳアルゴリ
ズムによる適応フィルタの収束速度が遅くなることが知
られている。そこで、広帯域雑音を合成音声に付加する
ことによって全周波数のＳ／Ｎを上げ、伝達関数の高精
度な推定を行うことができる。その際、応答音声信号パ
ワーに応じて雑音パワーを変化させることにより雑音が
ユーザーにとって耳障りとならないようにすることがで
きる。特に無音部では雑音が気になりやすいので雑音振
幅を０とおくとよい。Next, an example of highly accurate impulse response estimation will be described. It is known that the voice signal which is the input of the adaptive filter has a non-flat frequency spectrum, and therefore the convergence speed of the adaptive filter by the LMS algorithm becomes slow. Therefore, it is possible to increase the S / N of all frequencies by adding wideband noise to the synthesized speech, and perform highly accurate estimation of the transfer function. At that time, the noise power can be prevented from being offensive to the user by changing the noise power according to the response voice signal power. In particular, noise is likely to be noticed in the silent part, so the noise amplitude should be set to 0.

【００７９】また、付加する雑音はシステムを使用する
場所における環境雑音、例えば駅の人込みの雑音や計算
機室の雑音を録音したものか、または似たような雑音と
すれば一定の振幅で連続して出力しても耳障りでないよ
うにできる。The noise to be added is environmental noise in the place where the system is used, for example, noise of crowded people at a station or noise in a computer room is recorded, or similar noise is continuously generated at a constant amplitude. Even if it outputs it, it can be made not to be offensive.

【００８０】また、上述の音声信号による適応フィルタ
駆動時の収束速度の低下は、入力信号のスペクトル平坦
化によっても改善されることが知られている。平坦化の
ためには通常逆フィルタが使われるが、入力の差分信号
をとることによっても低周波成分に偏ったパワーを補正
することができる。差分処理は非常に簡単な処理である
ため計算量も少なく、リアルタイムシステムには都合が
良い。図２１は合成音の「いらっしゃい」の「い」の音
の周波数スペクトルで、曲線ａは差分処理後、曲線ｂは
もとのスペクトルを表している。差分処理によって中高
域成分のパワーが低域と同等となり、平坦化しているこ
とが理解される。It is also known that the decrease in the convergence speed when the adaptive filter is driven by the audio signal is also improved by flattening the spectrum of the input signal. Although an inverse filter is usually used for flattening, the power biased to the low frequency component can also be corrected by taking an input difference signal. Since the difference processing is a very simple process, the amount of calculation is small, which is convenient for a real-time system. FIG. 21 shows the frequency spectrum of the "I" sound of "Welcome" of the synthetic sound. The curve a shows the original spectrum after the difference processing. It is understood that the difference processing makes the power of the middle and high frequency components equal to that of the low frequency region and flattens it.

【００８１】また、図２２は「以上でよろしいですか」
という合成音声を入力としたときの伝達関数推定結果で
ある。曲線ｃは音声応答パワーに対して２０ｄＢ低いレ
ベルの白色雑音を付加した場合、曲線ｂは差分処理を使
った場合、ｄはどちらの処理も行わない場合の推定結果
であるが、雑音付加、差分処理各々により推定精度が向
上することが理解できる。更に、曲線ａは雑音付加と差
分処理を併用した場合の実験結果であるが、両処理の併
用により更に推定精度が向上することが理解できる。Also, FIG. 22 shows "Are you sure?"
Is a transfer function estimation result when a synthetic voice is input. A curve c is an estimation result when white noise of a level lower than the voice response power by 20 dB is added, a curve b is an estimation result when difference processing is used, and a d is an estimation result when neither processing is performed. It can be understood that the estimation accuracy is improved by each processing. Furthermore, the curve a is an experimental result when noise addition and difference processing are used together, but it can be understood that the estimation accuracy is further improved by using both of these processing together.

【００８２】次に合成音キャンセラを使って音声応答を
キャンセルする際の合成音の音量、スピーカとマイクロ
ホンの位置と向きの設定方法に関する例を以下に説明す
る。Next, an example of a method of setting the volume of the synthesized sound and the position and orientation of the speaker and the microphone when canceling the voice response using the synthesized sound canceller will be described below.

【００８３】図２３は合成音のパワーとキャンセル性能
の関係を示している。図でａは消去されたパワーを、ｂ
は残留パワーを表している。合成音を大きくするほど消
去パワーは大きくなるが残留パワーも大きくなるので、
音声認識に対しては合成音を小さく設定する方が効果的
であることが理解される。また、音声入力用のマイクロ
ホンや出力用のスピーカは指向性を持ち、設定によって
マイクロホンに入力される音声応答のパワーが異なるた
め、キャンセルの効果にも差が出てくる。図２４はマイ
クロホンの向きとキャンセル性能の関係を表した図で、
図２５に示すような設定でマイクロホンとスピーカのな
す角度φを変化させた結果である。図でｂは消去された
パワーを、ｃは残留パワーを表している。マイクロホン
は広く使用されている単一指向性のもので、感度最小と
なる死角はマイクロホンの握り柄の方向である。マイク
ロホンの頭をスピーカに向けた場合が最も消去パワーが
大きいが、残留パワーも大きくなる。逆に死角をスピー
カに向けた場合が残留パワーが最も小さいため、音声認
識に対して効果的であることが理解される。FIG. 23 shows the relationship between the power of the synthesized sound and the canceling performance. In the figure, a is the erased power, b
Represents the residual power. The larger the synthesized sound, the greater the erasing power, but the greater the residual power, so
It is understood that it is more effective to set the synthetic sound to be small for the voice recognition. In addition, the microphone for voice input and the speaker for output have directivity, and the power of the voice response input to the microphone differs depending on the setting, so that there is a difference in the effect of cancellation. FIG. 24 is a diagram showing the relationship between the direction of the microphone and the cancellation performance.
This is a result of changing the angle φ formed by the microphone and the speaker with the setting as shown in FIG. In the figure, b represents erased power and c represents residual power. The microphone is a widely used unidirectional type, and the blind spot with the minimum sensitivity is in the direction of the grip of the microphone. The erasing power is highest when the head of the microphone is directed toward the speaker, but the residual power is also high. On the contrary, when the blind spot is directed toward the speaker, the residual power is the smallest, and it is understood that it is effective for speech recognition.

【００８４】また、図２６はマイクロホンとスピーカと
の間の距離とキャンセル性能の関係を表している。図で
ａは消去されたパワーを、ｂは残留パワーを表してい
る。距離を大きくするほど残留パワーも小さくなること
が理解される。FIG. 26 shows the relationship between the distance between the microphone and the speaker and the cancel performance. In the figure, a represents erased power and b represents residual power. It is understood that the larger the distance, the smaller the residual power.

【００８５】以上を総合するとマイクロホンに入力され
る合成音をなるべく小さくすることが音声認識に対して
効果的な音響系の設定であることが理解される。したが
って、(1) 出力合成音は対話に差支えない範囲内で可能
な限り小さい音量とする、(2) マイクロホンの死角に入
るようにスピーカを置く、(3) スピーカとマイクロホン
はなるべく距離を離す、ことが効果的な音響系設定であ
る。From the above, it is understood that reducing the synthesized sound input to the microphone as much as possible is an effective acoustic system setting for voice recognition. Therefore, (1) the output synthesized sound should be as low as possible within the range that does not interfere with the dialogue, (2) put the speaker so as to be in the blind spot of the microphone, (3) distance the speaker and microphone as far as possible, That is an effective acoustic system setting.

【００８６】次に、本発明の第５実施例について説明す
る。該第５実施例は、システムからの応答出力中に利用
者が割り込んで入力を行うことへの対処を考慮した音声
対話システムであり、図２７に示すように入力認識理解
部４１と、対話管理部４２と、応答生成出力部４３と、
割込制御部４４から構成されている。そして、例えば図
２８（ａ）に示す如くの応答中に利用者からの割込み入
力を受けることのできない対話から同図（ｂ），
（ｃ），（ｄ）に示すように、割込み入力の意味を理解
するに必要なキーワードを認識し、あるいは、入力音声
の電力が最小音声持続時間Ｔ_V以上続けて始端検出しき
い値Ｐ_Sを越えた場合、割込み入力があったものとして
検出する。この検出に要する時間をＴ_detとする。そし
て、割込みを受けたら応答を中断する場合（ｂ）、割込
みを受けたら応答をフェードアウトさせる場合（ｃ）、
そして、割込みを受けたら応答の区切りの良いところま
で出力する場合（ｄ）など柔軟な対話を可能とさせる。Next, a fifth embodiment of the present invention will be described. The fifth embodiment is a voice dialogue system that takes into account the user's interruption and input during response output from the system, and as shown in FIG. 27, an input recognition understanding unit 41 and dialogue management. A unit 42, a response generation / output unit 43,
The interrupt control unit 44 is included. Then, for example, from the dialog shown in FIG. 28 (b), the dialog in which the interrupt input from the user cannot be received during the response as shown in FIG. 28 (a).
As shown in (c) and (d), the keyword necessary for understanding the meaning of the interrupt input is recognized, or the power of the input voice continues for the minimum voice duration T _V or more and the start edge detection threshold P _S. If it exceeds, it will be detected as if there was an interrupt input. The time required for this detection is T _det . When the response is interrupted when an interrupt is received (b), when the response is faded out when an interrupt is received (c),
Then, when an interrupt is received, a flexible dialogue is made possible such as when the response is output up to a good point (d).

【００８７】パターン認識理解部４１は、利用者からの
入力を検出、認識してその内容を理解するためのもの
で、入力メディアとして音声、キーボード、マウスやタ
ッチパネルなどのポインティングデバイスを利用してい
る。音声入力では、例えばＨＭＭやキーワードスポッテ
ィングなどの方法により発話内容を認識、意味を理解す
る。キーボード入力では文字列解析を行い、ポインティ
ングデバイスでは例えばポイント位置や移動方向、移動
速度情報からその意味を理解する。The pattern recognition / understanding unit 41 is for detecting and recognizing an input from the user to understand its contents, and uses a pointing device such as a voice, a keyboard, a mouse or a touch panel as an input medium. . In voice input, the contents of speech are recognized and the meaning is understood by a method such as HMM or keyword spotting. A character string is analyzed by keyboard input, and a pointing device understands its meaning from, for example, point position, moving direction, and moving speed information.

【００８８】対話管理部４２は、パターン認識理解部４
１から得た入力の理解結果から、次に出力すべき応答の
内容を決める。例えば、入力の理解結果とその履歴や入
力の直前のシステムの応答内容から計算機の内部状態が
決まるように対話の流れを状態遷移で表現し、予め決め
ておいた各状態での出力すべき応答内容のテーブルを参
照して、応答内容を決定する。応答内容の例を表１〜表
５に示す。The dialogue management unit 42 uses the pattern recognition understanding unit 4
From the understanding result of the input obtained from 1, the content of the response to be output next is determined. For example, the flow of dialogue is expressed by state transition so that the internal state of the computer is determined from the understanding result of the input, the history of the input, and the response contents of the system immediately before the input, and the response to be output in each predetermined state The content of the response is determined by referring to the content table. Tables 1 to 5 show examples of response contents.

【００８９】[0089]

【表１】 [Table 1]

【表２】 [Table 2]

【表３】 [Table 3]

【表４】 [Table 4]

【表５】まず、表１〜表３は「きのう来たメールのリストの表示
ですね。」という応答内容である。表１の例は、応答内
容の中に特に強調すべきポイントのない普通の場合であ
る。表２は、「きのう」であるかどうかを確認するとき
の応答内容の例であり、「きのう」の部分の重要性を高
くしている。表３は、「表示」するかどうかを確認する
ときの応答内容の例であり、「表示ですね」の部分の重
要性を高くしている。表４，５は「ホストpanda から応
答がありません。」という警告のための応答内容であ
り、応答内容の一部の重要性が高い例と応答全体の重要
性が高い例を示している。[Table 5] First, Tables 1 to 3 are the response contents of "Display of the list of mails that came yesterday." The example in Table 1 is a normal case where there is no particular emphasis in the response content. Table 2 is an example of the response contents when confirming whether or not it is “Kino”, and the importance of the “Kino” part is increased. Table 3 is an example of the response contents when confirming whether or not to "display", and the importance of the "display" part is increased. Tables 4 and 5 show the response contents for the warning "There is no response from the host panda". Some examples of the response contents are highly important, and examples of the entire response are highly important.

【００９０】応答生成出力部４３は、対話管理部４２で
決められた応答内容にしたがい、音声を含む応答メディ
ア、例えば応答内容にしたがった音韻処理、音響パラメ
ータの生成、音声波形の生成の順に処理することによる
合成音声などの聴覚的なメディアを用いた応答の生成、
音声応答と同じ応答文あるいはその要約した内容、ある
いはそのポイントとなる言葉のテキストや応答内容にし
たがい、システムの内部状態などを提示するグラフィク
スなどの視覚的なメディアなどを用いた応答を生成出力
する。対話管理部４２から応答内容が渡されると、応答
出力とその出力タイミングを示す応答出力位置情報を決
定し、それにしたがい応答出力を開始する。応答出力位
置情報の例を表６，図２９に示す。The response generation / output unit 43 processes the response media including voices, for example, phonological processing according to the response contents, generation of acoustic parameters, and generation of a voice waveform according to the response contents determined by the dialogue management unit 42. Response generation using auditory media such as synthetic speech,
Generates and outputs a response using visual media such as graphics showing the internal state of the system according to the same response sentence as the voice response or its summarized contents, or the text and response contents of the words that are the points. . When the response content is passed from the dialogue management unit 42, the response output and the response output position information indicating the output timing are determined, and the response output is started accordingly. An example of the response output position information is shown in Table 6 and FIG.

【００９１】[0091]

【表６】この例では、音声応答だけが記されているが、応用によ
りこの限りではなく、他の聴覚メディア、あるいは視覚
メディアについても同様の出力タイミングを示す応答出
力位置情報を決めることができる。[Table 6] In this example, only the voice response is described, but the application is not limited to this, and response output position information indicating the same output timing can be determined for other auditory media or visual media.

【００９２】この応答出力位置情報は、音声応答の場
合、出力する応答の例えば文、節、句、文節、単語、音
節、あるいはこれら複数からなる意味上のまとまりをな
すシーケンスを合成単位とし、この合成単位とその出力
時間を示すデータを一覧にしたものである。このような
合成単位毎の出力時間の一覧は、発話速度、合成素片の
継続時間長、応答出力開始時刻から容易に作成できる。
この応答出力位置情報により、図２９に示すように、応
答出力の途中におけるユーザの割込みがあると、その割
込みのあった時刻を応答出力と対応づけて知ることがで
き、割込制御部４４は割込制御情報を出力し、例えば応
答出力を途中で打切ったり、フェードアウトさせたり、
応答生成パラメータを変更することができる。In the case of a voice response, the response output position information is, for example, a sentence, a clause, a phrase, a clause, a word, a syllable, or a sequence forming a semantic unit of a plurality of these as a synthesis unit. It is a list of data indicating the composition unit and its output time. Such a list of output times for each synthesis unit can be easily created from the speech rate, the duration of the synthesis unit, and the response output start time.
With this response output position information, as shown in FIG. 29, when there is a user interrupt in the middle of response output, the time at which the interrupt occurred can be known in association with the response output. Outputs interrupt control information, for example, aborts response output or fades out,
You can change the response generation parameters.

【００９３】また、応答生成出力部４３は、音声応答の
生成を、公知の方法、例えば河井恒：“日本語テキスト
からの音声合成システム”東京大学学位論文（昭和６３
年１２月）に示されている方法により、図３０に構成例
を示すように、音声応答の発話速度、韻律、パワーなど
の応答生成パラメータの値を、それぞれ、発話速度決定
部４５、韻律決定部４６、パワー決定部４７において、
応答内容に応じて決定する。応答生成パラメータ値は、
音響パラメータの生成の際に決定する。またパワーの値
は、後述するように、波形生成後に変更することができ
る。例えば、後述するように応答内容の重要性が高けれ
ば、発話速度を緩め、イントネーションの変化幅を大き
く、パワーは大きめにするなどのように決める。イント
ネーションの変化幅は、公知の方法、例えば藤崎、須
藤：“日本語単語アクセントの基本周波数パタンとその
生成機構のモデル”日本音響学会誌，２７，９，ｐｐ４
４５〜４５３（昭和４６年）の方法により容易に制御で
きる。The response generation / output unit 43 can generate a voice response by a known method, for example, Tsune Kawai: “Speech Synthesis System from Japanese Text”, The University of Tokyo Dissertation (Showa 63).
30), the values of the response generation parameters such as the speech rate, the prosody, and the power of the voice response are determined by the speech rate determining unit 45 and the prosody determination, respectively, as shown in the configuration example in FIG. In the unit 46 and the power determination unit 47,
Determined according to the response content. The response generation parameter value is
Determined when generating acoustic parameters. The power value can be changed after the waveform is generated, as described later. For example, as will be described later, if the response content is highly important, the utterance speed is slowed down, the intonation variation range is increased, and the power is increased. The width of change of intonation can be determined by a known method, for example, Fujisaki and Sudo: "Fundamental frequency pattern of Japanese word accent and model of its generation mechanism", The Acoustical Society of Japan, 27, 9, pp4.
It can be easily controlled by the method of 45-453 (Showa 46).

【００９４】更に、応答生成出力部４３は、図３０の構
成例に示すように、割込制御部４４から応答割込制御情
報を受け取ると、それにしたがい出力中の音声を含む応
答を打切るか、出力中の音声応答の発話速度、韻律、パ
ワーを含む応答生成パラメータを変更する。応答を打切
る場合、出力中の合成単位までは出力してそこで出力を
打切る。合成単位が音節の場合、例えば、出力中の音節
や単語や文節の直後の境界まで応答を出力する。前述し
た通り、合成単位はさまざまな場合が考えられ、出力を
打切る場所の選び方はこの限りではない。このような応
答の中断方法は、合成単位を音節、単語、文節、句など
にすることにより、自然に応答出力を打切ることができ
る。規則合成などの場合には、音韻、単語、文節、句な
どの単位でまとめて合成をし、途中で打切る場合は、出
力中の合成単位までで応答が終わるように中断させ、録
音音声を再生する場合は、出力中の音声素片の出力が終
わった時点でそのまま応答を打切ればよい。また、応答
生成パラメータを変更する場合、発話速度決定部４５に
おいて発話速度を例えば±３０％変化させるとか、韻律
決定部４６においてアクセント・フレーズに対応するイ
ントネーションの変化率を±５０％変化させるとか、パ
ワー決定部４７おいて例えば１秒後に０になるようにフ
ェードアウトさせる減衰曲線を用意しておき、応答出力
波形にたたみこみをする、あるいは音響パラメータ生成
の際に、パワーの時間変化にこの減衰曲線をたたみこむ
などの方法により制御する。この減衰曲線は、打切り
用、フェードアウト用など複数用意しておくことができ
る。また、たたみこみの結果、出力が完全に０になると
ころで、応答出力を完了したものとして次の処理に移
る。なお、これらの変化率の値の例は応用に応じて変わ
りうるもので、必ずしもこの限りではない。Further, when the response generation / output section 43 receives the response interrupt control information from the interrupt control section 44 as shown in the configuration example of FIG. 30, whether the response generation output section 43 terminates the response including the voice being output. , The response generation parameters including the speech rate, prosody, and power of the voice response being output are changed. When the response is aborted, the synthesis unit being output is output and the output is aborted there. When the synthesis unit is a syllable, for example, the response is output up to the boundary immediately after the syllable or word or phrase being output. As described above, there are various cases in which the unit of composition is possible, and the method of selecting the place to cut off the output is not limited to this. In such a response interruption method, the response output can be naturally cut off by setting the synthesis unit to a syllable, a word, a phrase, a phrase, or the like. In the case of rule synthesis etc., it synthesizes in units such as phonemes, words, clauses, phrases, etc.When aborting midway, it is interrupted so that the response ends up to the synthesis unit being output, and the recorded voice is recorded. In the case of reproduction, the response may be terminated as it is when the output of the voice unit being output is finished. When changing the response generation parameter, the speech rate determining unit 45 changes the speech rate by, for example, ± 30%, or the prosody determining unit 46 changes the rate of change of intonation corresponding to the accent phrase by ± 50%. The power determination unit 47 prepares an attenuation curve that fades out so that it becomes 0 after 1 second, for example, and convolution of the response output waveform, or when generating the acoustic parameter, this attenuation curve is used for the time change of the power. It is controlled by a method such as folding. It is possible to prepare a plurality of attenuation curves for censoring and fade-out. When the output becomes 0 as a result of the convolution, the response output is considered to be completed and the next process is performed. It should be noted that the examples of the values of these change rates may change depending on the application, and are not necessarily limited thereto.

【００９５】表７は割込制御情報を示し、図３１（ａ）
は応答打切りなどのときの応答出力、同図（ｂ）は４番
目の出力単位で応答を打切る際の応答出力を示してい
る。また、図３２（ａ）は応答打切制御を示すフローチ
ャートであり、同図（ｂ）は応答内容のｎ番目の応答の
生成出力を具体的に示すフローチャートである。この例
では、ＣＶ音節パラメータを合成素片とする音声合成応
答の生成を示している。応用によりＣＶＣ音節パラメー
タを合成素片としたり、録音音声を再生することも可能
であり、応答生成出力の方法はこの限りではない。Table 7 shows the interrupt control information, which is shown in FIG.
Shows the response output when the response is terminated, and FIG. 7B shows the response output when the response is terminated in the fourth output unit. Further, FIG. 32A is a flowchart showing the response abort control, and FIG. 32B is a flowchart showing specifically the generation and output of the n-th response of the response content. In this example, generation of a voice synthesis response using a CV syllable parameter as a synthesis unit is shown. Depending on the application, it is also possible to use the CVC syllable parameter as a synthetic segment or reproduce recorded voice, and the method of response generation and output is not limited to this.

【００９６】[0096]

【表７】このような制御の流れにおいて応答を打切ったりフェー
ドアウトさせるタイミング、あるいは応答生成パラメー
タ値の変更を始めるタイミングは割込制御情報で指定さ
れる。例えば、発話速度を変える場合には、図３３に示
すように割込制御情報で指定されたタイミングから発話
速度を変更する。この例では応答内容の４番目の応答か
ら速度が上昇している。値の変更は、合成単位毎に変化
させてもよいが、指定されたタイミングからなめらかな
目標値に変化させても良い。また、韻律制御の場合は図
３４，３５に示されており、図３４は韻律変化が普通の
場合、図３５は応答内容の４番目の応答から変化が大き
くなった例である。録音音声を再生する場合は、韻律の
変化幅を変えた数種類の合成素片を用意しておき、割込
制御情報を受けて、変化幅に応じた素片を選択して再生
を行う。[Table 7] In such a control flow, the timing to cut off or fade out the response, or the timing to start changing the response generation parameter value is specified by the interrupt control information. For example, when changing the speech rate, the speech rate is changed from the timing designated by the interrupt control information as shown in FIG. In this example, the speed increases from the fourth response in the response content. The value may be changed for each composition unit, but may be changed to a smooth target value from the designated timing. The case of prosody control is shown in FIGS. 34 and 35. FIG. 34 shows an example in which the prosody change is normal, and FIG. 35 shows a large change from the fourth response of the response content. When playing back the recorded voice, several kinds of synthetic pieces with different prosodic variation widths are prepared, and the interruption control information is received, and a piece according to the variation width is selected and reproduced.

【００９７】また、図３６はパワー制御の例を示してお
り、このパワー制御曲線を、パワーのパラメータ値にた
たみこむか、あるいはパワーのパラメータのオフセット
値として利用する。同図（ａ）は応答内容の４番目応答
からパワーが増加する例、同図（ｄ）は４番目の応答か
らパワーが減少する例、同図（ｃ）は４番目の応答から
フェードアウトする例である。パワーのように時間的に
急激に変化させると本質的にノイズを生じてしまうパラ
メータでは、なめらかな曲線、例えば、臨界制動系のス
テップ応答曲線や、多項式曲線、三角関数による曲線な
どのたたみこみを行う。Further, FIG. 36 shows an example of power control, and this power control curve is folded into the power parameter value or used as the offset value of the power parameter. The figure (a) shows an example in which the power increases from the fourth response, the figure (d) shows the example in which the power decreases from the fourth response, and the figure (c) shows an example in which the power fades out from the fourth response. Is. For parameters such as power that causes noise essentially when abruptly changed with time, a smooth curve, such as a step response curve of a critical braking system, a polynomial curve, or a curve with a trigonometric function, is convolved. .

【００９８】一方、割込制御部４４は図３７〜図４０に
示す各フローチャートの流れにしたがって応答割込制御
情報を出力する。On the other hand, the interrupt control section 44 outputs the response interrupt control information according to the flow of each flowchart shown in FIGS.

【００９９】図３７は未出力応答の長さが少ないときは
割込を許可しない制御を行う例であり、応答出力中には
（ステップＳＴ５１でＹＥＳ）未出力応答の長さが基準
値以上であるか否かが判定される（ステップＳＴ５
２）。基準値は、合成単位の数やモーラ数、単語数、文
節数などを単位として決めておく。例えば８モーラと
か、３単語とか、合成単位１回分のような値にする。そ
して、基準値以上である場合には（ステップＳＴ５２で
ＹＥＳ）、すでに必要な情報を出力されていると見な
し、応答打切り等の制御を行う（ステップＳＴ５３）。
一方、未出力応答の長さが基準以下である場合には（ス
テップＳＴ５２でＮＯ）、未出力応答をそのまま出力す
る（ステップＳＴ５４）。その後、次の応答内容を決定
し、応答生成出力を行う（ステップＳＴ５５）。FIG. 37 shows an example in which the interrupt is not permitted when the length of the non-output response is small. During the response output (YES in step ST51), the length of the non-output response is not less than the reference value. It is determined whether or not there is (step ST5).
2). The reference value is determined in units of the number of synthesis units, the number of moras, the number of words, the number of clauses, and the like. For example, the value is set to 8 mora, 3 words, or one synthesis unit. If it is equal to or greater than the reference value (YES in step ST52), it is considered that necessary information has already been output, and control such as response termination is performed (step ST53).
On the other hand, when the length of the non-output response is less than the reference (NO in step ST52), the non-output response is output as it is (step ST54). After that, the content of the next response is determined and the response is generated and output (step ST55).

【０１００】図３８は出力中の応答内容が重要ならば応
答を中断せずそのまま出力するよう制御する例であり、
応答出力中には（ステップＳＴ６１でＹＥＳ）出力中の
応答内容の重要性を判断する（ステップＳＴ６２）。そ
して、重要である場合には（ステップＳＴ６２でＮ
Ｏ）、例えばパワーを減少させたり、発話速度を遅くさ
せる等の制御を行う（ステップＳＴ６２）。また、出力
中の応答内容が重要である場合には（ステップＳＴ６２
でＹＥＳ）、未出力応答を出力する（ステップＳＴ６
４）。その後、次の応答内容を決定し応答生成出力を行
う（ステップＳＴ６８）。前述したように応答内容の重
要性は、応答全体に対しても、あるいは応答の一部であ
る合成単位ごとに対しても判断でき、各場合についての
具体例は後述する。FIG. 38 shows an example in which if the contents of the response being output are important, control is performed so that the response is output without interruption.
During response output (YES in step ST61), the importance of the response content being output is determined (step ST62). If it is important (N in step ST62,
O), for example, control such as reducing power or slowing speech rate is performed (step ST62). If the response content being output is important (step ST62)
YES), a non-output response is output (step ST6).
4). After that, the contents of the next response are determined and a response is generated and output (step ST68). As described above, the importance of the response content can be determined for the entire response or for each synthesis unit that is a part of the response. Specific examples of each case will be described later.

【０１０１】図３９は割込入力の理解内容の重要性と出
力中の応答内容の重要性を比較して制御する例である。
つまり、話者からの入力内容とスピーカからの応答内容
とを比較して重要な方を優先させようとするものであ
る。FIG. 39 is an example of controlling by comparing the importance of the understanding content of the interrupt input with the importance of the response content being output.
That is, the input contents from the speaker and the response contents from the speaker are compared to give priority to the important one.

【０１０２】いま、応答出力中には（ステップＳＴ７１
でＹＥＳ）入力理解内容と出力理解内容との重要性の比
較が行われる（ステップＳＴ７２）。その結果、入力理
解内容の方が重要である場合には（ステップＳＴ７２で
ＹＥＳ）、応答出力のパワーを減少させたり、発話速度
を遅くすることにより、応答出力を制御する（ステップ
ＳＴ７４）。また、出力理解内容の方が重要である場合
には（ステップＳＴ７２でＮＯ）、未入力応答をそのま
ま出力する（ステップＳＴ７３）。その後、次の応答内
容を決定し、応答生成出力を行う（ステップＳＴ７
５）。Now, during response output (step ST71
YES), the importance of the input understanding content and the output understanding content are compared (step ST72). As a result, if the input understanding content is more important (YES in step ST72), the response output is controlled by reducing the power of the response output or slowing the speech rate (step ST74). When the output understanding content is more important (NO in step ST72), the non-input response is output as it is (step ST73). After that, the content of the next response is determined and a response is generated and output (step ST7).
5).

【０１０３】図４０は未出力応答中に重要な内容が含ま
れているうちは割込みを行わないよう制御する例であ
る。いま、応答出力中には（ステップＳＴ８１でＹＥ
Ｓ）未出力応答中に重要な内容があるか否がか判定され
る（ステップＳＴ８２）。そして、重要な内容がある場
合には（ステップＳＴ８２でＹＥＳ）、未出力の部分の
応答生成出力を行い（ステップＳＴ８３）、重要な内容
が出力されるまで繰り返す。そして、重要な内容が出力
されると（ステップＳＴ８２でＮＯ）、例えば応答打切
り等により応答出力を中断する（ステップＳＴ８４）。
その後、次の応答内容を決定し、応答生成出力を行う
（ステップＳＴ８５）。FIG. 40 shows an example in which the interrupt is not controlled while important contents are included in the non-output response. Now, during response output (YE in step ST81)
S) It is determined whether or not there is important content in the non-output response (step ST82). Then, if there is important content (YES in step ST82), response generation and output of the unoutput portion is performed (step ST83), and the process is repeated until important content is output. When important contents are output (NO in step ST82), the response output is interrupted, for example, by aborting the response (step ST84).
After that, the content of the next response is determined and a response is generated and output (step ST85).

【０１０４】また、パターン認識理解部４１での理解結
果を利用する場合、表８に例を示すように、その利用者
の割込み発声の内容の重要性を評価する。When the understanding result of the pattern recognition and understanding unit 41 is used, as shown in Table 8, the importance of the contents of the interruption utterance of the user is evaluated.

【０１０５】[0105]

【表８】例えば、訂正を意味する発話は相づちよりも高くなるよ
うに、応答の中断を要求する発話には普通の割込み発声
よりも高くなるように入力内容重要性を評価する。例え
ば相づちなど出力中の応答の中断を必要としない割込み
があった場合のように、入力の理解結果内容の重要性の
評価結果が低い場合、出力中の応答はそのまま出力す
る。また、評価結果が普通ないしは重要な場合には、出
力中の応答を中断ないしは応答生成パラメータを変更す
る応答割込制御情報を出力する。例えば、応答の中断を
要求する割込みがあった場合は、応答を中断させるか、
あるいは発話速度を速めたりして応答を早く終了させ
る。なお、表８に示した理解内容、重要性の例はあくま
で一例であり、応用によりこの限りではない。[Table 8] For example, the input content importance is evaluated so that the utterances meaning correction are higher than the syntactics and the utterances requiring interruption of the response are higher than the normal interrupt utterance. When the evaluation result of the importance of the understanding result of the input is low, for example, when there is an interrupt that does not require interruption of the response during output such as mutual cooperation, the response being output is output as it is. When the evaluation result is normal or important, the response interrupt control information for interrupting the response being output or changing the response generation parameter is output. For example, if there is an interrupt requesting to interrupt the response, either interrupt the response or
Alternatively, the speech speed is increased to end the response earlier. Note that the understanding contents and examples of importance shown in Table 8 are merely examples, and are not limited to these depending on the application.

【０１０６】応答生成出力部４３で出力中の応答内容を
利用する場合、応答内容の重要性と、割込みタイミング
を参照して応答出力の優先度を評価する。この応答出力
の優先度は、表１〜表５に例を示したように、応答の合
成単位毎、あるいは応答内容の全体の重要性を参照し
て、表９〜表１２に例を示すように評価する。When using the response contents being output by the response generation / output unit 43, the priority of the response output is evaluated with reference to the importance of the response contents and the interrupt timing. As shown in Tables 1 to 5, examples of the priority of the response output are shown in Tables 9 to 12 by referring to the importance of the response composition unit or the entire response content. Evaluate to.

【０１０７】[0107]

【表９】 [Table 9]

【表１０】 [Table 10]

【表１１】 [Table 11]

【表１２】例えば、利用者への警告や緊急性の高いメッセージを利
用者へ伝える応答内容のとき割込みがあった場合、即
ち、応答出力の優先度が高い場合、図３８に例を示した
ように、割込み入力を受け付けない。あるいは警告や緊
急性の極めて高い応答内容を出力中に割込みがあった場
合、応答出力の優先度が極めて高い場合、発話速度をゆ
っくり、ピッチ・パワーが高めになるような応答割込制
御情報を出力する。こうすることによってシステムから
の応答に対して割込みを許さない極めて重要な内容であ
ることを伝えることができる。また、ある程度応答出力
の優先度が高いとき割込みがあった場合、発話速度を速
く、ピッチ・パワーが高めになるよう応答割込制御情報
を出力する。一般の警告や緊急性の比較的高いメッセー
ジの出力の場合にこのような応答を出力することによ
り、割込みに対応して直ちに応答は止められないもの
の、できるだけ早く割込みに対処しようとしていること
を伝えることができる。なお、表９に示した応答内容、
重要性はあくまで一例であり、応用によりこの限りでは
ない。[Table 12] For example, if there is an interrupt when the content of the response is a message that warns the user or sends a message of high urgency to the user, that is, if the priority of the response output is high, the interrupt is generated as shown in the example of FIG. Do not accept input. Alternatively, if there is an interrupt during the output of a warning or an extremely urgent response, or if the priority of the response output is extremely high, response interrupt control information that slows down the speaking speed and raises the pitch power is displayed. Output. By doing this, it is possible to convey that the response from the system is of extremely important content that does not allow interruption. If an interrupt occurs when the priority of response output is high to some extent, response interrupt control information is output so that the speech speed is increased and pitch power is increased. By outputting such a response in the case of outputting a general warning or a message with a relatively high degree of urgency, although it is not possible to stop the response immediately in response to the interrupt, it informs that the interrupt is being dealt with as soon as possible. be able to. In addition, the response contents shown in Table 9,
The importance is just an example, and is not limited to this depending on the application.

【０１０８】次に割込入力があった場合の各部の処理を
順を追って説明する。システムからの応答の内容は、表
１〜表５に例を示した応答内容の形で、対話制御部が決
定する。これにしたがい、応答生成出力部は、まず、発
話速度決定部、韻律決定部、パワー決定部で発話速度、
韻律、パワーを求める。発話速度は、通常の応答の場合
には、例えば毎秒７モーラ程度の速度に設定し、韻律は
公知の方法で、例えば、広瀬、藤崎、河井、山口“基本
周波数パターン生成過程モデルに基づく文章音声の合
成”電子情報通信学会論文誌Ａ，ｖｏｌ．Ｊ７２−
Ａ，No. １，ｐｐ３２〜４０（平成元年１月）にある方
法で設定する。この発話速度にしたがい、合成素片の時
間長と応答出力開始時刻から表６に例を示した応答出力
位置情報を生成する。同時に応答を生成し出力を開始す
る。利用者からの割込入力があった場合に、パターン認
識理解部はこの入力を検出し、割込制御部に知らせると
共に、その意味内容を理解する。割込制御部は入力検出
を通知されると、応答出力位置情報と照合して割込入力
タイミングを調べる。割込入力タイミングが応答出力完
了後であれば、割込制御部は応答割込制御情報を出力せ
ず、対話制御部が次の応答内容を決定する。割込み入力
タイミングが応答出力完了の前であった場合、その入力
のパターン認識理解部４１での理解結果と応答生成出力
部４３で出力中の応答内容のいずれかまたは双方を利用
して応答割込制御情報を出力する。応答割込制御情報は
発話速度決定部、韻律決定部、パワー決定部、応答打切
制御部に送られ、前述のように発話速度を速める、ある
いは応答を打切る、パワーをフェートアウトさせるなど
の制御をする。また、応答割込制御情報にはどのタイミ
ングから応答出力を変更するかの情報も含まれており、
例えば応答内容のうち出力中の次の合成単位から応答出
力を変更する。Next, the processing of each unit when there is an interrupt input will be described step by step. The content of the response from the system is determined by the dialogue control unit in the form of the response content illustrated in Tables 1 to 5. According to this, the response generation and output unit first uses the speech rate determination unit, the prosody determination unit, and the power determination unit for the speech rate,
Seeking prosody and power. In the case of a normal response, the speech rate is set to, for example, about 7 mora per second, and the prosody is a known method, for example, Hirose, Fujisaki, Kawai, Yamaguchi "Sentence voice based on the fundamental frequency pattern generation process model. Synthesis of the Institute of Electronics, Information and Communication Engineers, A, vol. J72-
A, No. 1, pp32-40 (January 1989). According to this speech rate, the response output position information, an example of which is shown in Table 6, is generated from the time length of the synthesis element and the response output start time. At the same time, a response is generated and output is started. When there is an interrupt input from the user, the pattern recognition understanding unit detects this input, notifies the interrupt control unit, and understands its meaning. When notified of the input detection, the interrupt control unit checks the interrupt input timing by collating with the response output position information. If the interrupt input timing is after completion of the response output, the interrupt control unit does not output the response interrupt control information, and the dialogue control unit determines the next response content. When the interrupt input timing is before the completion of the response output, the response interruption is performed by using one or both of the understanding result of the pattern recognition understanding unit 41 of the input and the response content being output by the response generation output unit 43. Output control information. The response interrupt control information is sent to the speech rate determination unit, the prosody determination unit, the power determination unit, and the response cutoff control unit to control the speech rate, cut off the response, and fade out the power as described above. do. The response interrupt control information also includes information about when to change the response output,
For example, the response output is changed from the next synthesis unit being output among the response contents.

【０１０９】[0109]

【発明の効果】以上説明したように、本願第１の発明で
は、ユーザの発話信号に音声応答が重畳されてマイクロ
ホンから入力された場合でも、音声応答が除去され、発
話信号のみが音声認識される。従って、スピーカから音
声応答が出力されている際においても、ユーザからの発
話を認識することができる。その結果、極めて円滑な対
話が可能になるという効果が得られる。また、特にグラ
フィック情報や画像，アニメーション等の視覚データの
表示を行なってユーザと対話するマルチメディアシステ
ムにおいても極めて有効である。また、音声信号のパワ
ーが低減した際に適応化を停止させれば、伝達関数の推
定精度が低下することはなく、常に高い推定精度を維持
することができる。As described above, according to the first aspect of the present invention, even when the voice response of the user is superimposed on the voice response and input from the microphone, the voice response is removed and only the voice signal is recognized. It Therefore, the utterance from the user can be recognized even when the voice response is output from the speaker. As a result, the effect that extremely smooth dialogue is possible is obtained. Further, it is also extremely effective especially in a multimedia system in which visual information such as graphic information, images, and animation is displayed to interact with the user. Moreover, if the adaptation is stopped when the power of the audio signal is reduced, the estimation accuracy of the transfer function does not decrease, and high estimation accuracy can be maintained at all times.

【０１１０】また、本願第２の発明では、マイクロホン
からの取込まれた音声応答のパワーに応じて音声入力を
認識する際のしきい値を変化させている。従って、誤入
力を防止することが可能となり高精度な音声認識が可能
となる。Further, in the second aspect of the present invention, the threshold value for recognizing the voice input is changed according to the power of the voice response taken from the microphone. Therefore, erroneous input can be prevented, and highly accurate voice recognition can be performed.

【０１１１】また、本願第３の発明では、音声応答出力
中に利用者からの割込みがあった場合に、この入力内容
に応じて音声応答出力を継続するか、打切るか、途中ま
で継続するか等の制御を行う。これによって、スピーデ
ィに次の応答に移ることができ、入力内容に応じた高度
な対話が可能となるという効果が得られる。Further, in the third aspect of the present invention, when there is an interruption from the user during the voice response output, the voice response output is continued, aborted, or continued halfway depending on the input content. Controls whether or not. As a result, it is possible to quickly move to the next response, and it is possible to obtain an effect that an advanced dialogue according to the input content becomes possible.

[Brief description of drawings]

【図１】本発明が適用された音声対話システムの第１実
施例の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a first embodiment of a voice dialogue system to which the present invention is applied.

【図２】音声応答の除去特性を示す図である。FIG. 2 is a diagram showing a removal characteristic of a voice response.

【図３】第１実施例の動作を示すフローチャートであ
る。FIG. 3 is a flowchart showing the operation of the first embodiment.

【図４】ステップゲインμ（ｆ）を決定する操作を示す
フローチャートである。FIG. 4 is a flowchart showing an operation for determining a step gain μ (f).

【図５】音声応答のパワーとピッチの時間変化を示すタ
イムチャートである。FIG. 5 is a time chart showing changes over time in power and pitch of voice response.

【図６】音声応答部の内部構成を示すブロック図であ
る。FIG. 6 is a block diagram showing an internal configuration of a voice response unit.

【図７】音声応答、及びユーザの発話信号の時間変化を
示すタイムチャートである。FIG. 7 is a time chart showing a time response of a voice response and a speech signal of a user.

【図８】本発明が適用された音声対話システムの第２実
施例の構成を示すブロック図である。FIG. 8 is a block diagram showing a configuration of a second embodiment of a voice dialogue system to which the present invention is applied.

【図９】本発明の第３実施例の構成を示すブロック図で
ある。FIG. 9 is a block diagram showing a configuration of a third exemplary embodiment of the present invention.

【図１０】伝達関数の推定精度を示す特性図である。FIG. 10 is a characteristic diagram showing the estimation accuracy of a transfer function.

【図１１】推定精度と音声認識率との関係を示す特性図
である。FIG. 11 is a characteristic diagram showing a relationship between estimation accuracy and voice recognition rate.

【図１２】音声対話システムの外観を示す図である。FIG. 12 is a diagram showing an appearance of a voice dialogue system.

【図１３】各平滑化フィルタの出力パワーを示す図であ
る。FIG. 13 is a diagram showing the output power of each smoothing filter.

【図１４】適応化の停止期間を示す説明図である。FIG. 14 is an explanatory diagram showing a stop period of adaptation.

【図１５】第３実施例の動作を示すフローチャートであ
る。FIG. 15 is a flowchart showing the operation of the third embodiment.

【図１６】本発明の第４実施例の構成を示すブロック図
である。FIG. 16 is a block diagram showing a configuration of a fourth exemplary embodiment of the present invention.

【図１７】音声検出部の詳細を示すブロック図である。FIG. 17 is a block diagram showing details of a voice detection unit.

【図１８】音声信号と音声を認識する際のしきい値を示
す説明図である。FIG. 18 is an explanatory diagram showing threshold values when recognizing a voice signal and voice.

【図１９】音声を認識する際の状態遷移図である。FIG. 19 is a state transition diagram when recognizing a voice.

【図２０】しきい値を変更する動作を示すフローチャー
トである。FIG. 20 is a flowchart showing an operation of changing a threshold value.

【図２１】もとのスペクトル及び差分処理後のスペクト
ルを示す特性図である。FIG. 21 is a characteristic diagram showing an original spectrum and a spectrum after difference processing.

【図２２】“以上よろしいですか”という合成音声を入
力したときの伝達関数推定結果を示す特性図である。FIG. 22 is a characteristic diagram showing a transfer function estimation result when a synthetic voice “Is this all right?” Is input.

【図２３】合成音のパワーとキャンセル性能との関係を
示す特性図である。FIG. 23 is a characteristic diagram showing the relationship between the power of synthetic speech and the canceling performance.

【図２４】マイクロホンの向きとキャンセル性能との関
係を示す特性図である。FIG. 24 is a characteristic diagram showing the relationship between the microphone orientation and the canceling performance.

【図２５】マイクロホンとスピーカとの位置関係を示す
説明図である。FIG. 25 is an explanatory diagram showing a positional relationship between a microphone and a speaker.

【図２６】マイクロホンとスピーカとの間の距離と、キ
ャンセル性能との関係を示す特性図である。FIG. 26 is a characteristic diagram showing a relationship between the cancel performance and the distance between the microphone and the speaker.

【図２７】本発明の第５実施例の構成を示すブロック図
である。FIG. 27 is a block diagram showing a configuration of a fifth exemplary embodiment of the present invention.

【図２８】音声応答と音声入力の出力タイミングを示す
タイムチャートである。FIG. 28 is a time chart showing output timings of voice response and voice input.

【図２９】割込発話と応答出力とのタイミングを示すタ
イムチャートである。FIG. 29 is a time chart showing the timing of an interrupt utterance and a response output.

【図３０】応答生成出力部の詳細な構成を示すブロック
図である。FIG. 30 is a block diagram showing a detailed configuration of a response generation / output unit.

【図３１】応答打切りがある場合とない場合との応答出
力を示すタイムチャートである。FIG. 31 is a time chart showing response outputs with and without response termination.

【図３２】応答打切制御の流れを示すフローチャートで
ある。FIG. 32 is a flowchart showing the flow of response abort control.

【図３３】発話速度を上昇させる例を示すタイムチャー
トである。FIG. 33 is a time chart showing an example of increasing the speech rate.

【図３４】韻律変化が同一であるときの各信号を示すタ
イムチャートである。FIG. 34 is a time chart showing each signal when the prosody changes are the same.

【図３５】韻律変化が大きくなる際の各信号を示すタイ
ムチャートである。FIG. 35 is a time chart showing each signal when a prosody change becomes large.

【図３６】パワーを変化させる際のタイムチャートであ
る。FIG. 36 is a time chart when the power is changed.

【図３７】未出力応答の量が少ないときは割込制御を禁
止する動作を示すフローチャートである。FIG. 37 is a flowchart showing an operation of prohibiting interrupt control when the amount of non-output response is small.

【図３８】出力中の応答内容が重要なときは中断しない
よう制御する際のフローチャートである。FIG. 38 is a flowchart for controlling not to interrupt when the response content during output is important.

【図３９】割込内容及び出力内容の重要度に応じて割込
みを許可するか否かを決める際のフローチャートであ
る。FIG. 39 is a flowchart for deciding whether or not to permit interrupts according to the importance of interrupt contents and output contents.

【図４０】未出力応答中に重要な内容が含まれている際
には割込みを禁止するよう制御する際のフローチャート
である。FIG. 40 is a flowchart for controlling to prohibit interrupts when important contents are included in a non-output response.

【符号の説明】１マイクロホン２音声応答除去部３アダプティブフィルタ３ａルックアップテーブル４減算器５音声認識部７音声応答部８スピーカ１０音声合成部１１音楽合成部１５伝達関数更新制御部３１Ａ／Ｄ変換器３２Ａ／Ｄ変換器３３第１の平滑化フィルタ３４第２の平滑化フィルタ３５適応・停止切換部３７音声検出部３８検出しきい値決定部３９音声判定部４０インパルス応答推定部４１入力認識理解部４２対話管理部４３応答生成出力部４４割込制御部[Explanation of symbols] 1 microphone 2 Voice response removal unit 3 Adaptive filter 3a lookup table 4 subtractor 5 Speech recognition unit 7 Voice response section 8 speakers 10 Speech synthesizer 11 Music Synthesis Department 15 Transfer function update control unit 31 A / D converter 32 A / D converter 33 First smoothing filter 34 Second Smoothing Filter 35 Adaptation / stop switching unit 37 voice detector 38 Detection threshold value determination unit 39 Voice judgment unit 40 Impulse response estimation unit 41 Input recognition and understanding section 42 Dialog Management Department 43 Response generation / output section 44 Interrupt control unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＧ１０Ｌ 15/28 21/02 (72)発明者瀬戸重宣神奈川県川崎市幸区小向東芝町１株式会社東芝総合研究所内 (72)発明者山下泰樹兵庫県神戸市東灘区本山町８−６−26 株式会社東芝関西システムセンター内 (56)参考文献特開平２−250099（ＪＰ，Ａ) 特開昭59−195739（ＪＰ，Ａ) 特開昭60−216392（ＪＰ，Ａ) 特開昭61−262798（ＪＰ，Ａ) 特開昭60−114900（ＪＰ，Ａ) 実開平２−83600（ＪＰ，Ｕ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/00 G10L 21/02 ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁷ Identification code FI G10L 15/28 21/02 (72) Inventor Shigenori Seto 1 Komukai Toshiba-cho, Kouki-ku, Kawasaki-shi, Kanagawa Toshiba Research Institute Co., Ltd. (72) Inventor Yasushi Yamashita 8-6-26 Motoyama-cho, Higashinada-ku, Kobe-shi, Hyogo Inside Toshiba Kansai System Center (56) References JP-A-25-250099 (JP, A) JP-A-59-195739 ( JP, A) JP 60-216392 (JP, A) JP 61-262798 (JP, A) JP 60-114900 (JP, A) Jitsukaihei 2-83600 (JP, U) (58) ) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 13/00 G10L 21/02

Claims

(57) [Claims]

1. A recognizes the voice input from the input means such as a microphone, a predetermined audio response based on the recognition result in the speech recognition method for performing the conversation outputted from the output unit such as a speaker, A transfer function which is the frequency spectrum of the impulse response of the voice response output from the output means is estimated using a response generation parameter for synthesizing the voice response, and the transfer function used when estimating the impulse response Estimation
When a value is stored and a voice is detected from the input means,
Estimate impulse response from stored past estimates
Or update the estimate to estimate the impulse response.
The voice response is corrected by the impulse response estimated as a result of the determination, only the corrected voice response is canceled from the input voice, and the voice after canceling the voice response is recognized. A speech recognition method characterized by the above.

2. Input from input means such as a microphone
The recognized voice is recognized, and the predetermined voice is recognized based on the recognition result.
The response is output from the output means such as a speaker and the dialogue is performed.
In the voice recognition method for listening, the impulse response of the voice response output from the output means.
The transfer function, which is the frequency spectrum of
Estimating using the response generation parameter for synthesis , smoothing the voice signal power of the voice response, and estimating the transfer function when it is determined that no voice response is output based on the smoothed voice signal. The value is not updated, and the voice response is corrected by the estimated impulse response.
Then, only the corrected voice response is captured from the input voice.
And recognize the voice after canceling the voice response.
Characteristic voice recognition method.

3. The voice signal power of the voice response is smoothed, and when it is determined that a voice response is not output based on the smoothed voice signal, the estimated value of the transfer function is not updated. The voice recognition method according to claim 1.

4. A means for obtaining a background noise power in the absence of voice input, a means for obtaining a synthesized sound power in a microphone signal based on an impulse response at the time of synthesized speech output, the background noise power and the synthesized speech. A sum of the power and a voice input power detection threshold is used, and means for determining whether or not there is voice input based on the duration of the voice input power exceeding the threshold; Means for performing voice recognition only when it is determined that there is a voice interaction device.