JP2008233305A

JP2008233305A - Voice dialogue apparatus, voice dialogue method and program

Info

Publication number: JP2008233305A
Application number: JP2007070111A
Authority: JP
Inventors: Kinichi Wada; 錦一和田; Iko Terasawa; 位好寺澤; Hiroyuki Hoshino; 博之星野; Hiroaki Sekiyama; 博昭関山; Toshiyuki Nanba; 利行難波
Original assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2007-03-19
Filing date: 2007-03-19
Publication date: 2008-10-02

Abstract

【課題】利用者の主観評価を向上させながら、間違った対話を継続しない、かつ不要な応答を行わない音声対話装置を提供する。
【解決手段】受理と確認との閾値である第１の閾値と確認と棄却との閾値である第２の閾値との組合せ条件を複数設定し、入力された音声データを文字データに変換する音声認識手段２３と、文字データに含まれる単語の信頼度を算出する信頼度算出手段２５と、信頼度を、第１の閾値と、第２の閾値と比較し、受理、確認、棄却のいずれの応答を行うかを制御する応答制御手段２７と、を有する音声対話装置１の評価実験を行い、被験者の主観評価結果、特に正確性と効率性の評価結果に基づいて第１の閾値と第２の閾値との組合せ条件の中で最も好ましい条件を決定する。
【選択図】図２Provided is a voice dialogue apparatus that does not continue a wrong dialogue and does not perform an unnecessary response while improving the subjective evaluation of a user.
SOLUTION: A plurality of combination conditions of a first threshold that is a threshold for acceptance and confirmation and a second threshold that is a threshold for confirmation and rejection are set, and speech that converts input speech data into character data The recognizing means 23, the reliability calculating means 25 for calculating the reliability of the word included in the character data, and comparing the reliability with the first threshold value and the second threshold value, and accepting, confirming or rejecting The voice interactive apparatus 1 having response control means 27 for controlling whether to perform a response is subjected to an evaluation experiment, and the first threshold value and the second threshold value are determined based on the subject's subjective evaluation results, particularly the accuracy and efficiency evaluation results. The most preferable condition among the combination conditions with the threshold value is determined.
[Selection] Figure 2

Description

本発明は、音声認識技術に関し、特に、音声認識結果に基づいて応答制御を行う音声対話装置、音声対話方法、プログラムに関する。 The present invention relates to a voice recognition technique, and more particularly to a voice dialogue apparatus, a voice dialogue method, and a program that perform response control based on a voice recognition result.

近年、音声認識技術が広く利用されるようになり、カーナビゲーション、携帯電話、ゲーム、コールセンタなどで使われるようになっている。また、ＰＣ上のソフトウェアとして、各種のディクテーションプログラムも存在する。
そして、音声認識結果に基づいて応答制御を行う仕組みとして、音声認識結果である単語の信頼度を用いて応答制御を行うものが開示されている（特許文献１、非特許文献１参照）。 In recent years, voice recognition technology has been widely used and used in car navigation, mobile phones, games, call centers, and the like. Various dictation programs exist as software on the PC.
As a mechanism for performing response control based on a speech recognition result, a mechanism for performing response control using the reliability of a word that is a speech recognition result is disclosed (see Patent Document 1 and Non-Patent Document 1).

特許文献１に示す方式では、音声認識結果である単語の信頼度を用いて、音声認識結果を受理／確認するように応答制御する音声対話装置を実現する。そして、受理／確認を制御する閾値を意味カテゴリ毎に持ち、音声対話装置の動作時にその対話履歴データを基に閾値を補正する。誤認識単語を受理した場合には閾値を高く、正解認識単語を確認した場合は閾値を低く補正する。 The method shown in Patent Document 1 realizes a voice interaction apparatus that performs response control so as to accept / confirm a voice recognition result using the reliability of a word that is a voice recognition result. Then, a threshold value for controlling acceptance / confirmation is provided for each semantic category, and the threshold value is corrected based on the conversation history data during the operation of the voice interaction apparatus. When an erroneously recognized word is accepted, the threshold is increased, and when the correct recognized word is confirmed, the threshold is corrected to be lower.

また、非特許文献１に示す方式では、音声認識結果である単語の信頼度を用いて、音声認識結果を受理／確認／棄却するように応答制御する音声対話装置を実現する。そして、受理／確認を制御する閾値１と、確認／棄却を制御する閾値２とを客観的な指標である各種の誤り率を用いて決定する。ここで、閾値１の決定に用いる誤り率は、ＦＡ１＝１−受理した正解単語数÷受理した単語の総数、ＳＥｒｒ＝１−受理した正解単語数÷正解単語の総数（＝評価単語の総数）、の２種類である。また、閾値２の決定に用いる誤り率は、ＦＡ２＝１−確認した正解単語数÷確認した単語の総数、ＦＲ＝棄却した正解単語数÷棄却した単語の総数、の２種類である。
特開２００５−１８１３８６号公報河原達也・駒谷和範、“音声対話システムにおける音声認識結果の信頼度の利用法”、日本音響学会２０００年秋季研究発表会講演論文集Ｉ、ｐ７３、２０００年． Further, the method shown in Non-Patent Document 1 realizes a voice interaction apparatus that performs response control so as to accept / confirm / reject a voice recognition result using the reliability of a word that is a voice recognition result. Then, a threshold value 1 for controlling acceptance / confirmation and a threshold value 2 for controlling confirmation / rejection are determined using various error rates that are objective indices. Here, the error rate used to determine the threshold 1 is FA1 = 1−the number of accepted correct words ÷ the total number of accepted words, SErr = 1−the number of accepted correct words ÷ the total number of correct words (= the total number of evaluation words). There are two types. Further, there are two types of error rates used for determining the threshold 2: FA2 = 1−the number of confirmed correct words ÷ the total number of confirmed words, FR = the number of rejected correct words ÷ the total number of rejected words.
JP 2005-181386 A Tatsuya Kawahara and Kazunori Komatani, “Utilization of Reliability of Speech Recognition Results in Spoken Dialogue Systems”, Proceedings of the Acoustical Society of Japan 2000 Autumn Meeting, p73, 2000.

しかしながら、上述の特許文献１では、確認／棄却に関する応答制御がなく、誤認識結果で確認応答すれば利用者は必ず「いいえ」と答えることになり、結果として、再度発話しなければならず、発話効率が良いとは言えない。信頼度が十分低い場合、誤認識している可能性が非常に高いことから、棄却するように応答制御をすべきである。 However, in the above-mentioned Patent Document 1, there is no response control regarding confirmation / rejection, and if a confirmation response is made with a misrecognition result, the user always answers “no”, and as a result, the user must speak again. Speaking efficiency is not good. If the reliability is sufficiently low, the possibility of misrecognition is very high, so response control should be performed so as to reject it.

また、上述の非特許文献１では、閾値の決定はあくまで客観評価上の最適値であり、主観評価が必ず向上するとは限らない。 In Non-Patent Document 1 described above, the determination of the threshold value is an optimum value for objective evaluation, and subjective evaluation is not necessarily improved.

本発明は、前述した問題点に鑑みてなされたもので、その目的は利用者の主観評価を向上させながら、間違った対話を継続しない、かつ不要な応答を行わない音声対話装置を提供することである。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a voice interactive apparatus that does not continue an erroneous conversation and does not perform an unnecessary response while improving the subjective evaluation of the user. It is.

前述した目的を達成するために第１の発明は、入力された音声データを文字データに変換する音声認識手段と、前記文字データに含まれる単語の信頼度を算出する信頼度算出手段と、前記信頼度を、受理と確認との閾値である第１の閾値と、確認と棄却との閾値である第２の閾値と比較し、受理、確認、棄却のいずれの応答を行うかを制御する応答制御手段と、を具備し、前記第１の閾値と前記第２の閾値は、主観評価に基づき決定したものであることを特徴とする音声対話装置である。
また、前記主観評価は、正確性と効率性についての評価であることが望ましい。 In order to achieve the above-described object, the first invention provides speech recognition means for converting inputted speech data into character data, reliability calculation means for calculating the reliability of words included in the character data, Response that controls whether to accept or confirm or reject by comparing the reliability with the first threshold that is the threshold for acceptance and confirmation and the second threshold that is the threshold for confirmation and rejection And a control means, wherein the first threshold value and the second threshold value are determined based on subjective evaluation.
Moreover, it is desirable that the subjective evaluation is an evaluation on accuracy and efficiency.

第２の発明は、入力された音声データを文字データに変換するステップと、前記文字データに含まれる単語の信頼度を算出するステップと、前記信頼度を、受理と確認との閾値である第１の閾値と、確認と棄却との閾値である第２の閾値と比較し、受理、確認、棄却のいずれの応答を行うかを制御するステップと、を含み、前記第１の閾値と前記第２の閾値は、主観評価に基づき決定したものであることを特徴とする音声対話方法である。
また、前記主観評価は、正確性と効率性についての評価であることが望ましい。 According to a second aspect of the present invention, the input voice data is converted into character data, the step of calculating the reliability of a word included in the character data, and the reliability is a threshold value for acceptance and confirmation. A threshold value of 1 and a second threshold value that is a threshold value of confirmation and rejection, and controlling whether to accept, confirm, or reject the response. The threshold of 2 is a voice interaction method characterized by being determined based on subjective evaluation.
Moreover, it is desirable that the subjective evaluation is an evaluation on accuracy and efficiency.

第３の発明は、コンピュータを請求項１または請求項２に記載の音声対話装置として機能させるプログラムである。 A third invention is a program for causing a computer to function as the voice interactive apparatus according to claim 1 or claim 2.

本発明により、利用者の主観評価を向上させながら、間違った対話を継続しない、かつ不要な応答を行わない音声対話装置を提供することができる。 According to the present invention, it is possible to provide a voice interactive apparatus that does not continue an erroneous conversation and does not perform an unnecessary response while improving the subjective evaluation of the user.

以下図面に基づいて、本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本実施の形態に係る音声対話装置１を実現するコンピュータのハードウェア構成図である。
図１に示すように、音声対話装置１は、制御部３、記憶部５、メディア入出力部７、通信制御部９、入力部１１、表示部１３、周辺機器Ｉ／Ｆ部１５等が、バス１７と介して接続される。
なお、以下の実施の形態では、コンピュータをハードウェアとして利用した音声対話装置１の例を示しているが、コンピュータに限るものではなく、例えばカーナビゲーション装置、携帯電話端末、ゲーム装置等の各種電子機器に応用することも可能である。 FIG. 1 is a hardware configuration diagram of a computer that realizes the voice interaction apparatus 1 according to the present embodiment.
As shown in FIG. 1, the voice interaction apparatus 1 includes a control unit 3, a storage unit 5, a media input / output unit 7, a communication control unit 9, an input unit 11, a display unit 13, a peripheral device I / F unit 15, and the like. It is connected to the bus 17.
In the following embodiment, an example of the voice interactive device 1 using a computer as hardware is shown. However, the present invention is not limited to a computer, and various electronic devices such as a car navigation device, a mobile phone terminal, a game device, etc. It can also be applied to equipment.

制御部３は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等で構成される。 The control unit 3 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.

ＣＰＵは、記憶部５、ＲＯＭ、記録媒体等に格納されるプログラムをＲＡＭ上のワークメモリ領域に呼び出して実行し、バス１７を介して接続された各装置を駆動制御し、音声対話装置１が行う後述する情報検索処理（図４参照）を実現する。
ＲＯＭは、不揮発性メモリであり、コンピュータのブートプログラムやＢＩＯＳ等のプログラム、データ等を恒久的に保持している。
ＲＡＭは、揮発性メモリであり、記憶部５、ＲＯＭ、記録媒体等からロードしたプログラム、データ等を一時的に保持するとともに、制御部３が各種処理を行う為に使用するワークエリアを備える。 The CPU calls and executes a program stored in the storage unit 5, ROM, recording medium, etc. to a work memory area on the RAM, executes driving control of each device connected via the bus 17, and the voice interactive device 1 An information retrieval process (see FIG. 4) described later is performed.
The ROM is a non-volatile memory and permanently holds a computer boot program, a program such as BIOS, data, and the like.
The RAM is a volatile memory, and temporarily stores programs, data, and the like loaded from the storage unit 5, ROM, recording medium, and the like, and includes a work area used by the control unit 3 for performing various processes.

記憶部５は、ＨＤＤ（ハードディスクドライブ）であり、制御部３が実行するプログラム、プログラム実行に必要なデータ、ＯＳ（オペレーティングシステム）等が格納される。プログラムに関しては、ＯＳ（オペレーティングシステム）に相当する制御プログラムや、後述の情報検索処理に相当するアプリケーションプログラムが格納されている。
これらの各プログラムコードは、制御部３により必要に応じて読み出されてＲＡＭに移され、ＣＰＵに読み出されて各種の手段として実行される。 The storage unit 5 is an HDD (hard disk drive), and stores a program executed by the control unit 3, data necessary for program execution, an OS (operating system), and the like. As for the program, a control program corresponding to an OS (operating system) and an application program corresponding to information search processing described later are stored.
Each of these program codes is read by the control unit 3 as necessary, transferred to the RAM, read by the CPU, and executed as various means.

メディア入出力部７（ドライブ装置）は、データの入出力を行い、例えば、フロッピー（登録商標）ディスクドライブ、ＰＤドライブ、ＣＤドライブ（−ＲＯＭ、−Ｒ、ＲＷ等）、ＤＶＤドライブ（−ＲＯＭ、−Ｒ、−ＲＷ等）、ＭＯドライブ等のメディア入出力装置を有する。 The media input / output unit 7 (drive device) inputs / outputs data, for example, floppy (registered trademark) disk drive, PD drive, CD drive (-ROM, -R, RW, etc.), DVD drive (-ROM, -R, -RW, etc.) and a media input / output device such as an MO drive.

通信制御部９は、通信制御装置、通信ポート等を有し、コンピュータとネットワーク１９間の通信を媒介する通信インタフェースであり、ネットワーク１９を介して、他のコンピュータ間との通信制御を行う。 The communication control unit 9 includes a communication control device, a communication port, and the like, and is a communication interface that mediates communication between the computer and the network 19, and performs communication control between other computers via the network 19.

入力部１１は、データの入力を行い、例えば、キーボード、マウス等のポインティングデバイス、テンキー等の入力装置を有する。また、例えば、マイク等の音声入力装置を有する。入力部１１を介して、コンピュータに対して、操作指示、動作指示、データ入力等を行うことができる。 The input unit 11 inputs data and includes, for example, a keyboard, a pointing device such as a mouse, and an input device such as a numeric keypad. In addition, for example, a voice input device such as a microphone is included. An operation instruction, an operation instruction, data input, and the like can be performed on the computer via the input unit 11.

表示部１３は、ＣＲＴモニタ、液晶パネル等のディスプレイ装置、ディスプレイ装置と連携してコンピュータのビデオ機能を実現するための論理回路等（ビデオアダプタ等）を有する。 The display unit 13 includes a display device such as a CRT monitor and a liquid crystal panel, and a logic circuit (such as a video adapter) for realizing a video function of the computer in cooperation with the display device.

周辺機器Ｉ／Ｆ（インタフェース）部１５は、コンピュータに周辺機器を接続させるためのポートであり、周辺機器Ｉ／Ｆ部１５を介してコンピュータは周辺機器とのデータの送受信を行う。周辺機器Ｉ／Ｆ部１５は、ＵＳＢやＩＥＥＥ１３９４やＲＳ−２３２Ｃ等で構成されており、通常複数の周辺機器Ｉ／Ｆを有する。周辺機器との接続形態は有線、無線を問わない。 The peripheral device I / F (interface) unit 15 is a port for connecting a peripheral device to the computer, and the computer transmits and receives data to and from the peripheral device via the peripheral device I / F unit 15. The peripheral device I / F unit 15 is configured by USB, IEEE 1394, RS-232C, or the like, and usually has a plurality of peripheral devices I / F. The connection form with the peripheral device may be wired or wireless.

バス１７は、各装置間の制御信号、データ信号等の授受を媒介する経路である。 The bus 17 is a path that mediates transmission / reception of control signals, data signals, and the like between the devices.

次に、図２を参照しながら、音声対話装置１の構成について説明する。
図２は、音声対話装置１の機能を示すブロック図である。 Next, the configuration of the voice interaction apparatus 1 will be described with reference to FIG.
FIG. 2 is a block diagram showing functions of the voice interaction apparatus 1.

音声対話装置１は、音声入力手段２１、音声認識手段２３、信頼度算出手段２５、応答制御手段２７等を備える。 The voice interactive apparatus 1 includes a voice input unit 21, a voice recognition unit 23, a reliability calculation unit 25, a response control unit 27, and the like.

音声入力手段２１は、利用者が発話した音声をデータとして入力する。音声データは、音声対話装置１の入力部１１を介して入力されても良いし、ネットワーク１９を介して、他のコンピュータ等から入力されても良い。 The voice input means 21 inputs voice spoken by the user as data. The voice data may be input via the input unit 11 of the voice interaction apparatus 1 or may be input from another computer or the like via the network 19.

音声認識手段２３は、入力された音声データを文字データに変換する。まず、入力された音声データの波形から音声特徴量を抽出する。次に、抽出された音声特徴量を入力として、話者性や音声入力環境等の音響的特徴を定める音響モデル、言い回し等の文章表現や認識対象単語等の言語的特徴を定める言語モデルを用いた尤度計算を行う。そして、認識候補の中から尤度の高いものを選択し、文字データに変換する。例えば、音声分析結果である入力文に対して全ての可能性を計算し、ｎ個までの複数の文候補を求め、認識結果として尤度の高い順に文候補を出力するｎ−ｂｅｓｔ方式が多く用いられている。 The voice recognition means 23 converts the input voice data into character data. First, a voice feature amount is extracted from the waveform of input voice data. Next, using the extracted speech features as input, use an acoustic model that defines acoustic features such as speaker characteristics and speech input environment, and a language model that defines linguistic features such as wording such as wording and words to be recognized. Perform the likelihood calculation. And the thing with high likelihood is selected from the recognition candidates, and it converts into character data. For example, there are many n-best methods that calculate all possibilities for an input sentence that is a speech analysis result, obtain a plurality of sentence candidates up to n, and output sentence candidates in descending order of likelihood as recognition results. It is used.

信頼度算出手段２５は、文字データに含まれる単語の信頼度を算出する。ここで、信頼度（ＣｏｎｆｉｄｅｎｃｅＭｅａｓｕｒｅ：ＣＭ）とは、音声認識結果である文字データに含まれる単語をどれだけ信頼して良いかを表す尺度である。信頼度の数値が高い場合、文字データに含まれる単語に競合するような他の候補が見当たらなかったことを示し、信頼度の数値が低い場合、他の候補が多く競合していたことを示す。 The reliability calculation means 25 calculates the reliability of words included in the character data. Here, the confidence level (Confidence Measurement: CM) is a measure representing how much a word included in the character data as a speech recognition result can be trusted. A high reliability number indicates that no other candidate was found to compete with the words in the text data, and a low confidence number indicates that many other candidates were competing. .

信頼度算出式の例として、ｎ−ｂｅｓｔ方式による音声認識結果であるｎ個の文候補の結果を利用する算出式について述べる。これは、直観的には、どの文候補にも一貫して出現する単語は信頼できるとみなすものであり、第ｉ候補の対数尤度をｇ（ｉ）とし、単語ｗの信頼度ＣＭ（ｗ）を以下の算出式で求める。

ここで、単語ｗが第ｉ候補に含まれるときはδ（ｗ,ｉ）＝１、含まれないときはδ（ｗ,ｉ）＝０である。また、αはスムージング係数である。 As an example of the reliability calculation formula, a calculation formula using the results of n sentence candidates that are speech recognition results by the n-best method will be described. Intuitively, a word that appears consistently in any sentence candidate is considered to be reliable, the log likelihood of the i-th candidate is g (i), and the reliability CM (w of the word w ) Is obtained by the following calculation formula.

Here, when the word w is included in the i-th candidate, δ (w, i) = 1, and when it is not included, δ (w, i) = 0. Α is a smoothing coefficient.

応答制御手段２７は、信頼度を、受理と確認との閾値である第１の閾値と、確認と棄却との閾値である第２の閾値と比較し、受理、確認、棄却のいずれの応答を行うかを制御する。 The response control means 27 compares the reliability with a first threshold value that is a threshold value for acceptance and confirmation and a second threshold value that is a threshold value for confirmation and rejection. Control what to do.

図３は、信頼度を用いた応答制御を示す図である。
図３に示すように、信頼度の値が第１の閾値２９より大きい場合、制御部３は、受理の応答を行う。受理の応答とは、音声認識結果をそのまま受け入れて対話を続けることである。音声認識結果が正しい場合、後述する確認の応答を省くことができる。次に、信頼度の値が第１の閾値２９以下であり、第２の閾値３１より大きい場合、制御部３は、確認の応答を行う。確認の応答とは、音声認識結果が正しいかどうかを利用者に確認することである。音声認識結果が誤りである場合、誤った認識結果で対話を続けることを回避することができる。そして、信頼度の値が第２の閾値３１以下の場合、制御部３は、棄却の応答を行う。棄却の応答とは、音声認識結果を使用せずに利用者に再度同じ質問をすることである。音声認識の結果が誤りである場合、確認の応答を省略することができる。 FIG. 3 is a diagram illustrating response control using reliability.
As shown in FIG. 3, when the reliability value is larger than the first threshold value 29, the control unit 3 makes an acceptance response. The acceptance response is to accept the speech recognition result as it is and continue the dialogue. When the voice recognition result is correct, a confirmation response described later can be omitted. Next, when the reliability value is equal to or less than the first threshold value 29 and greater than the second threshold value 31, the control unit 3 makes a confirmation response. The confirmation response is to confirm with the user whether the voice recognition result is correct. When the speech recognition result is incorrect, it is possible to avoid continuing the conversation with the incorrect recognition result. And when the value of reliability is below the 2nd threshold value 31, the control part 3 performs the rejection response. The rejection response is to ask the user the same question again without using the voice recognition result. If the result of speech recognition is incorrect, the confirmation response can be omitted.

次に、図４を参照しながら、音声対話装置１の動作の詳細について説明する。
図４は、音声対話処理の手順を示すフローチャートである。 Next, details of the operation of the voice interactive apparatus 1 will be described with reference to FIG.
FIG. 4 is a flowchart showing the procedure of the voice interaction process.

図４に示すように、入力部１１を介して、音声データが入力されると（ステップ１０１）、制御部３は、入力された音声データを文字データに変換する（ステップ１０２）。 As shown in FIG. 4, when voice data is input via the input unit 11 (step 101), the control unit 3 converts the input voice data into character data (step 102).

次に、制御部３は、文字データに含まれる単語の信頼度の算出を行う（ステップ１０３）。信頼度の算出は、例えば、図２の説明で前述した算出式を用いて行う。 Next, the control part 3 calculates the reliability of the word contained in character data (step 103). The calculation of the reliability is performed using, for example, the calculation formula described above with reference to FIG.

次に、制御部３は、算出した信頼度が第１の閾値より大きいかどうか確認する（ステップ１０４）。
信頼度が第１の閾値より大きい場合、制御部３は、受理の応答を行う（ステップ１０５）。
信頼度が第１の閾値以下の場合、ステップ１０６に進む。 Next, the control unit 3 confirms whether or not the calculated reliability is greater than the first threshold (step 104).
When the reliability is larger than the first threshold, the control unit 3 makes an acceptance response (step 105).
When the reliability is equal to or lower than the first threshold value, the process proceeds to step 106.

次に、制御部３は、算出した信頼度が第２の閾値より大きいかどうか確認する（ステップ１０５）。
信頼度が第２の閾値より大きい場合、制御部３は、確認の応答を行う（ステップ１０６）。
信頼度が第２の閾値以下の場合、制御部３は、棄却の応答を行う（ステップ１０７）。
以上の処理を１つの対話処理として、複数の対話処理を繰り返し行うことで、制御部３は、利用者との音声対話を実現する。 Next, the control unit 3 confirms whether or not the calculated reliability is greater than the second threshold (step 105).
When the reliability is larger than the second threshold, the control unit 3 makes a confirmation response (step 106).
When the reliability is equal to or lower than the second threshold, the control unit 3 makes a rejection response (step 107).
The control unit 3 realizes a voice dialogue with the user by repeatedly performing a plurality of dialogue processings with the above processing as one dialogue processing.

次に、図５から図９を参照しながら、第１の閾値と第２の閾値をどのように決定するかについて説明する。 Next, how to determine the first threshold value and the second threshold value will be described with reference to FIGS.

図５は、第１の閾値と第２の閾値の組合せ条件を示す図である。
図５に示すように、例えば、３種類の第１の閾値と第２の閾値の組合せ条件を設定し、図２に示される各機能を有する音声対話装置１の評価実験を行い、被験者の主観評価結果に基づいて第１の閾値と第２の閾値を決定する。
尚、第１の閾値と第２の閾値の組合せ条件は、３種類に限定されるものではなく、３種類以上の組合せ条件を設定しても良い。 FIG. 5 is a diagram illustrating a combination condition of the first threshold value and the second threshold value.
As shown in FIG. 5, for example, three types of combination conditions of a first threshold value and a second threshold value are set, and an evaluation experiment of the voice interaction apparatus 1 having each function shown in FIG. A first threshold value and a second threshold value are determined based on the evaluation result.
The combination conditions of the first threshold value and the second threshold value are not limited to three types, and three or more types of combination conditions may be set.

図６は、図５で示した条件ごとの応答動作を示す図である。
図６に示すように、条件Ａは、確認の応答を行う範囲を大きく設定している。条件Ｂは、棄却の応答を行う範囲を大きく設定している。条件Ｃは、受理の応答を行う範囲を大きく設定している。このように、条件Ａ〜条件Ｃは、応答動作が大きく異なっていることが分かる。
以下、実際に行った評価実験について説明する。 FIG. 6 is a diagram showing a response operation for each condition shown in FIG.
As shown in FIG. 6, the condition A sets a large range in which a confirmation response is made. Condition B sets a large range for the rejection response. Condition C has a large range for accepting responses. Thus, it can be seen that the response operation is greatly different between the conditions A to C.
Hereinafter, evaluation experiments actually performed will be described.

まず、本評価実験の実験諸元について説明する。
図２で示した音声認識手段２３を実現する音声認識エンジンは、一般に公開されているオープンソースソフトウェアであるＪｕｌｉｕｓ３．５である。
次に、音響モデルは、Ｊｕｌｉｕｓ３．５と合わせて一般に公開されている不特定話者のＰＴＭ（ＰｈｏｎｅｔｉｃＴｉｅｄ−Ｍｉｘｔｕｒｅ：音素内タイドミクスチャ）トライフォンモデルである。
また、言語モデルは、３００万文の認識文リストから学習した語彙数３５００語のモデルである。
そして、これらのモジュールを組み込んだ音声対話による施設検索サービスを行う音声対話装置１を構築し、本評価実験を行った。尚、被験者数は、音声対話処理に習熟した２０〜４０代の男女６名である。 First, experimental specifications of this evaluation experiment will be described.
The speech recognition engine that realizes the speech recognition means 23 shown in FIG. 2 is Julius 3.5, which is open source software that is open to the public.
Next, the acoustic model is a PTM (Photonic Tied-Mixture) triphone model of an unspecified speaker that is publicly disclosed together with Julius 3.5.
The language model is a model of 3500 words learned from a recognized sentence list of 3 million sentences.
And the voice interactive apparatus 1 which performs the facility search service by the voice dialogue incorporating these modules was constructed, and this evaluation experiment was conducted. The number of subjects is 6 men and women in their 20s and 40s who are proficient in voice dialogue processing.

次に、本評価実験の主観評価の結果について説明する。
図７は、被験者の主観評価を示す図である。
図７に示すように、主観評価は、正確性と効率性の２つの観点について被験者から回答を得た。図７に示す評価点は、５段階評価によって得た被験者からの回答の平均値である。正確性においては、条件Ａが最も高い評価を得たことが分かる。一方、効率性においては、条件Ｃが最も高い評価を得たが、条件Ａもほぼ同等の評価を得たことが分かる。
ここで、条件の決定について説明する。例えば、正確性の評価結果において最も評価の低い条件Ｃを外す。これは、正確性を確保するためである。次に、例えば、効率性の評価結果において条件Ａと条件Ｂのうち評価の低い条件Ｂを外す。これは、効率性を向上するためである。そうすると、この３種類の条件の中では、条件Ａが最も好ましい第１の閾値と第２の閾値の組合せであると決定できる。 Next, the result of the subjective evaluation of this evaluation experiment will be described.
FIG. 7 is a diagram showing the subjective evaluation of the subject.
As shown in FIG. 7, in the subjective evaluation, answers were obtained from subjects on two viewpoints of accuracy and efficiency. The evaluation score shown in FIG. 7 is an average value of responses from subjects obtained by a five-step evaluation. It can be seen that Condition A obtained the highest evaluation in terms of accuracy. On the other hand, in terms of efficiency, the condition C obtained the highest evaluation, but it can be seen that the condition A also obtained substantially the same evaluation.
Here, determination of conditions will be described. For example, the condition C having the lowest evaluation in the accuracy evaluation result is removed. This is to ensure accuracy. Next, for example, the low-evaluation condition B is removed from the conditions A and B in the efficiency evaluation result. This is to improve efficiency. Then, among these three types of conditions, it can be determined that the condition A is the most preferable combination of the first threshold value and the second threshold value.

次に、最も好ましい第１の閾値と第２の閾値の組合せであると決定した条件Ａが、被験者の総合的な評価ではどのように評価されたかについて説明する。
図８は、図５で示した条件ごとの対人対話との比較評価の結果を示す図である。
図８に示す得点は、有人オペレータとの音声対話による施設検索サービスを利用した時の評価を１００点と想定した場合の比較得点である。
図８に示すように、条件Ａは他の条件に比べて約５点高い得点が得られていることが分かる。 Next, how the condition A determined to be the most preferable combination of the first threshold value and the second threshold value is evaluated in the overall evaluation of the subject will be described.
FIG. 8 is a diagram illustrating a result of comparative evaluation with the interpersonal dialogue for each condition illustrated in FIG. 5.
The score shown in FIG. 8 is a comparative score when the evaluation when using the facility search service by voice dialogue with a manned operator is assumed to be 100 points.
As shown in FIG. 8, it can be seen that the condition A has a score that is about 5 points higher than the other conditions.

次に、本評価実験の客観評価の結果について説明する。
図９は、図５で示した条件ごとの受理誤り率と平均所要時間を示す図である。
図９の横軸は、誤認識単語を誤って受理の応答を行った割合である受理誤り率（＝１−受理した正解単語数÷受理した単語の総数）である。また、図９の縦軸は、平均所要時間である。平均所要時間とは、音声対話を開始してから検索キーワードを全て入力するまでの時間を平均したものであり、システムが検索を実行してから結果を表示するまでの時間は含まない。受理誤り率は、客観評価上の正確性の指標であり、平均所要時間は客観評価上の効率性の指標といえる。受理誤り率が生じていなければ、間違った対話を継続していないことになる。また、平均所要時間が極端に多くなければ、不要な応答を行っていないことになる。
図９に示すように、条件Ｃは、受理誤りが生じており、客観評価上の正確性が確保できていないことが分かる。このことから、主観評価による正確性の評価結果において条件Ｃを外したことは妥当であることが裏付けられる。
一方、条件Ａと条件Ｂは、受理誤りが生じておらず、客観評価上の正確性が確保できている。更に、条件Ａと条件Ｂは、条件Ｃと比べて平均所要時間の差が小さく、不要な応答を行っていないことが分かる。 Next, the result of objective evaluation of this evaluation experiment will be described.
FIG. 9 is a diagram showing an acceptance error rate and an average required time for each condition shown in FIG.
The horizontal axis of FIG. 9 represents an acceptance error rate (= 1−the number of accepted correct words ÷ the total number of accepted words), which is the ratio of erroneously recognizing misrecognized words. Moreover, the vertical axis | shaft of FIG. 9 is an average required time. The average required time is an average of the time from the start of the voice dialogue until the input of all the search keywords, and does not include the time from when the system executes the search until the result is displayed. The acceptance error rate is an index of accuracy in objective evaluation, and the average required time can be said to be an index of efficiency in objective evaluation. If there is no acceptance error rate, the wrong dialogue is not continued. If the average required time is not extremely long, an unnecessary response is not performed.
As shown in FIG. 9, it can be seen that the condition C has an acceptance error, and the accuracy in objective evaluation cannot be secured. This confirms that it is appropriate to remove the condition C in the accuracy evaluation result by subjective evaluation.
On the other hand, the condition A and the condition B have no acceptance error, and the accuracy in objective evaluation can be secured. Further, it can be understood that the difference in average required time between the condition A and the condition B is smaller than that in the condition C, and an unnecessary response is not performed.

以上、詳細に説明したように、本実施の形態によれば、図２に示される各機能を有する音声対話装置１の評価実験を行い、被験者の主観評価結果、特に正確性と効率性の評価結果に基づいて第１の閾値と第２の閾値を決定する。 As described above in detail, according to the present embodiment, an evaluation experiment of the voice interactive device 1 having each function shown in FIG. 2 is performed, and a subjective evaluation result of the subject, particularly an evaluation of accuracy and efficiency is performed. A first threshold value and a second threshold value are determined based on the result.

尚、図７の説明においては、主観評価の結果だけに基づいて第１の閾値と第２の閾値を決定するとしたが、主観評価の結果と客観評価の結果の両方に基づいて第１の閾値と第２の閾値を決定しても良い。例えば、客観評価上、１００％の正確性を確保する必要がある場合、図９で示した受理誤り率が０％となる第１の閾値と第２の閾値との組合せの中から、図７で示した主観評価による効率性の評価結果で最も高い評価の第１の閾値と第２の閾値との組合せに決定しても良い。 In the description of FIG. 7, the first threshold value and the second threshold value are determined based only on the result of subjective evaluation. However, the first threshold value is determined based on both the result of subjective evaluation and the result of objective evaluation. And the second threshold may be determined. For example, when it is necessary to ensure 100% accuracy in objective evaluation, the combination of the first threshold value and the second threshold value shown in FIG. The combination of the first threshold value and the second threshold value with the highest evaluation in the efficiency evaluation result based on the subjective evaluation shown in FIG.

本実施の形態によって、利用者の主観評価が向上する音声対話装置１を提供することができる。また、間違った対話を継続しないようにしながら、不要な応答を行わないようにすることができる。そして、利用者は、音声対話装置１による様々なサービスを使って良かったと感じ、安心して利用を続けることができる。 According to the present embodiment, it is possible to provide the voice interactive apparatus 1 that improves the subjective evaluation of the user. Further, it is possible to prevent an unnecessary response from being performed while preventing a wrong dialogue from continuing. Then, the user feels that it is good to use various services provided by the voice interaction device 1, and can continue to use it with peace of mind.

以上、添付図面を参照しながら、本発明に係る音声対話装置等の好適な実施形態について説明したが、本発明はかかる例に限定されない。当業者であれば、本願で開示した技術的思想の範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 The preferred embodiments of the voice interactive apparatus and the like according to the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to such examples. It will be apparent to those skilled in the art that various changes or modifications can be conceived within the scope of the technical idea disclosed in the present application, and these naturally belong to the technical scope of the present invention. Understood.

音声対話装置１を実現するコンピュータのハードウェア構成図Hardware configuration diagram of a computer for realizing the voice interactive apparatus 1 音声対話装置１の機能を示すブロック図Block diagram showing functions of the voice interactive apparatus 1 信頼度を用いた応答制御を示す図Diagram showing response control using reliability 音声対話処理の手順を示すフローチャートFlow chart showing the procedure of voice dialogue processing 第１の閾値と第２の閾値の組合せ条件を示す図The figure which shows the combination conditions of a 1st threshold value and a 2nd threshold value 図５で示した条件ごとの応答動作を示す図The figure which shows the response operation | movement for every condition shown in FIG. 被験者の主観評価を示す図Diagram showing subject's subjective evaluation 図５で示した条件ごとの対人対話との比較評価の結果を示す図The figure which shows the result of the comparative evaluation with the interpersonal dialogue for every condition shown in FIG. 図５で示した条件ごとの受理誤り率と平均所要時間を示す図Figure showing acceptance error rate and average time required for each condition shown in Figure 5

Explanation of symbols

１………音声対話装置
３………制御部
５………記憶部
７………メディア入出力部
９………通信制御部
１１………入力部
１３………表示部
１５………周辺機器Ｉ／Ｆ部
１７………バス
１９………ネットワーク
２１………音声入力手段
２３………音声認識手段
２５………信頼度算出手段
２７………応答制御手段 DESCRIPTION OF SYMBOLS 1 ......... Voice interaction apparatus 3 ......... Control part 5 ......... Storage part 7 ......... Media input / output part 9 ......... Communication control part 11 ......... Input part 13 ......... Display part 15 ......... Peripheral device I / F unit 17... Bus 19... Network 21... Voice input means 23 ... Voice recognition means 25 .... Reliability calculation means 27 ... Response control means

Claims

Voice recognition means for converting input voice data into character data;
Reliability calculation means for calculating the reliability of words included in the character data;
The reliability is compared with a first threshold, which is a threshold for acceptance and confirmation, and a second threshold, which is a threshold for confirmation and rejection, to control whether to accept, confirm, or reject Response control means;
Comprising
The spoken dialogue apparatus according to claim 1, wherein the first threshold value and the second threshold value are determined based on subjective evaluation.

The spoken dialogue apparatus according to claim 1, wherein the subjective evaluation is an evaluation of accuracy and efficiency.

Converting the input voice data into character data;
Calculating a reliability of a word included in the character data;
The reliability is compared with a first threshold, which is a threshold for acceptance and confirmation, and a second threshold, which is a threshold for confirmation and rejection, to control whether to accept, confirm, or reject Steps,
Including
The voice dialogue method according to claim 1, wherein the first threshold value and the second threshold value are determined based on subjective evaluation.

4. The spoken dialogue method according to claim 3, wherein the subjective evaluation is an evaluation of accuracy and efficiency.

A program for causing a computer to function as the voice interactive apparatus according to claim 1.