JP3788793B2

JP3788793B2 - Voice dialogue control method, voice dialogue control device, voice dialogue control program

Info

Publication number: JP3788793B2
Application number: JP2003121290A
Authority: JP
Inventors: 純一平澤; 俊一郎山本; 翼篠崎; 毅文山崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-04-25
Filing date: 2003-04-25
Publication date: 2006-06-21
Anticipated expiration: 2023-04-25
Also published as: JP2004325848A

Description

【０００１】
【発明の属する技術分野】
この発明は各種の予約システム等に利用することができる音声対話制御方法、音声対話制御装置、音声対話制御プログラムに関し、特にシステムからの音声出力中に利用者からの音声入力が割込んだ場合でも、対話の処理を円滑に実行することができる音声対話制御方法、音声対話制御装置、音声対話制御プログラムを提供しようとするものである。
【０００２】
【従来の技術】
従来より各種の音声対話装置、例えば特許文献１が提案されている。特許文献１ではシステムからの音声出力中に、ユーザが発話し、割込が発生した場合に、システムからの音声出力を一時的に中断し、その利用者からの割込発話の内容に応じて音声出力を再開するか、終了するかを決定する制御方法（特許文献１、図１、図５、図７）を提案している。
図１３を用いてその制御の様子を簡単に説明する。時点Ｔ₀でシステムからの音声出力ＳＹＳ−１が再生され始める。本来であればこの音声出力ＳＹＳ−１は時点Ｔｎまで継続して再生される。然しながら、ここでは時点Ｔ₁で利用者から音声ＵＳＲ−２が発話され、割込音声入力ＵＳＲ−２の継続時間を計測し、その継続時間がある閾値より長ければ音声の再生を中止し、ある閾値より短ければ音声の再生を再開させ、音声出力ＳＹＳ−１の続きを再生する方法を提案している。
また、特許文献１の請求項４では割込音声入力ＵＳＲ−２の内容を解析し、割込音声入力ＵＳＲ−２の内容が肯定的であれば音声出力ＳＹＳ−１の続きを再開させ、否定的であれば音声出力を中止する方法を提案している。
【０００３】
【特許文献１】
特開平８−１４６９９１号公報
【０００４】
【発明が解決しようとする課題】
特許文献１ではシステムからの音声出力中に利用者から割込音声入力が発生した場合に、割込音声入力の内容と継続時間に応じてシステムからの音声出力を再開させるか中止するかの二者択一方法で制御している。この二者択一方式による制御方法によれば例えばシステムから施設の説明を再生している最中に「えっ！」等と短い割込の発話があっても、その発話により説明の再生が一時中断されるが、再び再生が再開される。また、「終了してください。」或は「ありがとう」等と比較的長い割込の発話をすると、説明の再生が中止され、説明文の再生を途中で終了させることができる。
従来技術によるこの手法は、システム音声出力が、利用者への説明や提示のように、比較的、長い時間にわたって出力を続ける場合に、利用者がシステム音声出力を停止させたいのか、それとも継続してシステム音声出力を傾聴し続けたいのかを判定することを目的としている。
然しながら、システムからの質問と利用者からの応答の対話の発話権が頻繁に交替するような対話システムで、かつ、利用者からの割込音声入力を常時受け付けるようなシステムでは、利用者からの割込音声入力の内容や長さ、割込まれた位置に応じて、よりきめ細かな対応を行わないと、発話が混乱して、対話が潤滑に進行しなくなるという問題が発生する。
【０００５】
例えば、システムから利用者に対して、質問が行われるのとほぼ同時である、数百ミリ秒程度のタイミングで、利用者からの割込み発話が入力された場合は、利用者がシステム音声出力を停止させたいかどうかということより、利用者はシステム音声出力を聞き始める前に、発話を始めようとしていた可能性が高い。従って、利用者からの割り込みにより出力を一時停止させるシステム音声出力は、継続するか、再生を終了するか、という選択肢では不十分であり、利用者からの割込み入力音声が、どのシステム音声出力に対する応答なのかを判定することが重要となる。
また、システム出力音声の開始と、利用者からの割込音声入力がほぼ同時になる状況では、利用者は発話を始めかけていたにも関わらず、システムの出力音声が始まったのに驚き、割込音声入力を途中で止めてしまう傾向がある。従って、利用者からの割込音声入力が存在したからと言って、その認識結果をそのまま利用すると、利用者が途中で止めた断片的な発話に対する認識結果は十分な内容を含んでいないことも多く、却って対話を混乱させる要因となることがある。そこで、利用者からの割込音声入力に対しては、入力音声の長さ、認識尤度、認識結果の内容などを総合的に判断して、断片的な発話として扱い認識結果を活用しない方がよいのか、非断片発話として扱い認識結果を活用してよいのか、を判定する必要がある。
【０００６】
図１４を用いて、利用者からの割込音声入力の違いに対応させて、システムが次の出力音声を変化させるべき例を説明する。例えばシステムの音声出力ＳＹＳ−１で乗車日時を問い合わせ、利用者からの音声入力ＵＳＲ−２で「３月１１日」と応答した後、次の入力項目である乗車駅の駅名を問い合わせるための音声出力ＳＹＳ−３の開始直後に、利用者から割込音声入力ＵＳＲ−４で「東京駅」と発話されると、システムはそれ以後の音声出力ＳＹＳ−３の再生を中止し、駅名「東京駅」を入力値として取得し、音声出力ＳＹＳ−３に対する入力値を確定することができる。
但し、図１４Ｃに示すように、音声出力ＳＹＳ−１に対し、音声入力ＵＳＲ−２´で「３月１１日」と入力した後、音声出力ＳＹＳ−３の再生途中で割込音声入力ＵＳＲ−４´で「あっ、やっぱり３月１０日」と発話があった場合、システムは音声出力ＳＹＳ−３を中止するだけで、駅名の入力項目に対して「３月１０日」が入力されても、システムが、利用者から乗車駅が入力されるものとして認識処理を待ち受けていると、割込音声入力ＵＳＲ−４´は乗車駅に言及していないので、システムは音声出力ＳＹＳ−３に対する利用者からの入力値を確定させることができず、再度音声出力ＳＹＳ−５で「ご乗車の駅名を〜」のように、乗車駅名の入力を促すような問合せを行うことになってしまい、利用者とシステムとの間の対話が噛み合わない状況を引き起こしやすい。
【０００７】
この発明の目的は音声対話装置において、利用者の発話がシステムの音声出力に対して割込を発生しても、利用者の発話内容を解析し、利用者が意図する対話状況を維持し、既に入力値が確定している入力項目に対しても、修正を行うことができる音声対話制御方法、音声対話制御装置、音声対話制御プログラムを提案しようとするものである。
【０００８】
【課題を解決するための手段】
この発明では、システムからの音声出力が終了する毎に、この音声出力に対応する利用者からの応答が音声で入力され、この音声入力をシステムが音声認識し、音声認識結果を直前に終了した音声出力に対応する入力値として取得する通常応答処理を実行し、この通常応答処理後に再び音声出力を発生する対話動作を繰返す音声対話制御方法において、システムの第１音声出力が終了し、第２音声出力中に割込音声入力が発生した場合、第２音声出力の内容の重要度を考慮して予め定められた音声出力停止／非停止情報に基づいて、第２音声出力を継続させるか、一時停止させるかを判定する第１判定処理と、この第１判定処理において第２音声出力を一時停止すると判定した場合、第２音声出力の開始から、利用者からの割込音声入力の始端までの所要時間を計測することにより、第２音声出力中での利用者からの割込位置を取得し、割込位置の時刻が予め設定した所定の値より早ければ、第２音声出力の内容は利用者に未伝達と判定し、割込位置の時刻が予め設定した所定の値より遅ければ、第２音声出力の内容は利用者に伝達済と判定する第２判定処理と、利用者からの割込音声入力に対して、認識結果、認識尤度、認識音声の継続長などから、割込音声が十分な内容を伴う非断片発話であったか、それとも十分な内容を伴わない断片的な発話であったかを判定する第３判定処理とを実行し、これら第１判定処理、第２判定処理、第３判定処理の各判定結果に応じて、第１判定処理で第２音声出力を継続させると判定した場合に、割込音声入力を無視して第２音声出力を継続して出力する動作を実行する第１応答処理と、第１判定処理で第２音声出力を一時停止させると判定した場合に、第２音声出力の再生を一時中止させる処理を実行する第２応答処理と、第２判定処理で伝達済と判定し、更に第３判定処理で割込入力が非断片発話、と判定した場合は割込音声入力を第２音声出力に対する応答として取り込む処理を行う第３応答処理と、第２判定処理で未伝達と判定し、更に第３判定処理で割込音声入力が非断片発話、と判定した場合は割込音声入力を第１音声出力に対する応答として取り込む処理を行う第４応答処理と、第２判定処理で伝達済と判定し、第３判定処理で断片発話と判定した場合は第２応答処理により停止されている第２音声出力の再生を再開させる第５応答処理と、第２判定処理で未伝達と判定され、第３判定処理で割込音声入力が断片発話と判定された場合は、第２応答処理により停止されている第２音声出力の再生を冒頭から出力し直す第６応答処理と、第２判定処理で未伝達と判定され、第３判定処理で割込音声入力が断片発話と判定され、更に第１音声出力に対して利用者からの応答がなかった場合は第１音声出力を冒頭から出力し直す第７応答処理と、の何れかを実行する音声対話制御方法を提案する。
【０００９】
この発明では更に、システムからの第１音声出力が終了する毎に、この音声出力に対応する利用者からの応答が音声で入力され、この音声入力をシステムで音声認識し、音声認識結果を直前に終了した第１音声出力に対応する入力値として取得する通常応答処理を実行し、この通常応答処理後に再び第２音声出力を発生する対話動作を繰返す音声対話制御装置において、システムの第１音声出力が終了し、第２音声出力中に利用者から割込音声入力が発生した場合、第２音声出力の内容の重要度を考慮して予め定められた音声出力の停止/非停止情報に基づいて第２音声出力を継続させるか、一時停止させるかを判定する第１判定手段と、この第１判定手段において第２音声出力を一時停止すると判定した場合、第２音声出力の開始から、利用者からの割込音声入力の始端までの所要時間を計測することにより、第２音声出力中での利用者からの割込位置を取得し、割込位置の時刻が予め設定した所定の値より早ければ、第２音声出力の内容は利用者に未伝達と判定し、割込位置の時刻が予め設定した所定の値より遅ければ、第２音声出力の内容は利用者に伝達済と判定する第２判定手段と、利用者からの割込音声入力に対して、認識結果、認識尤度、認識音声の継続長などから、割込音声が十分な内容を伴う非断片発話であったか、それとも十分な内容を伴わない断片的な発話であったかを判定する第３判定手段とを実行し、これら第１判定手段、第２判定手段、第３判定手段の各判定結果に応じて、第１判定手段で第２音声出力を継続させると判定した場合に、割込音声入力を無視して第２音声出力を継続して出力する動作を実行する第１応答手段と、第１判定手段で第２音声出力を一時停止させると判定した場合に、第２音声出力の再生を一時中止させる第２応答手段と、第２判定手段で伝達済と判定し、更に第３判定手段で割込入力が非断片発話、と判定した場合は割込音声入力を第２音声出力に対する応答として取り込む処理を行う第３応答手段と、第２判定手段で未伝達と判定し、更に第３判定手段で割込音声入力が非断片発話、と判定した場合は割込音声入力を第１音声出力に対する応答として取り込む処理を行う第４応答手段と、第２判定手段で伝達済と判定し、第３判定手段で断片発話と判定した場合は、第２応答手段により停止されている第２音声出力の再生を再開させる第５応答手段と、第２判定手段で未伝達と判定し、第３判定手段で断片発話と判定した場合は、第２応答手段により停止されている第２音声出力の再生を冒頭から出力し直す第６応答手段と、第２判定手段で未伝達と判定され、第３判定手段で割込音声入力が断片発話と判定され、更に第１音声出力に対して利用者からの応答がなかった場合は第１音声出力を冒頭から出力し直す第７応答手段とによって構成した音声対話制御装置を提案する。
この発明では更に、コンピュータが解読可能な符号列によって記述され、コンピュータに上記音声対話制御方法を実行させる音声対話制御プログラムを提案する。
【００１０】
作用
この発明によれば、第１判定処理及び第２判定処理、第３判定処理を施し、これらの各判定処理の判定結果に応じて通常の応答処理に加えて、音声出力を強制的に再生し続ける第１応答処理と、音声出力を一時中断させる第２応答処理と、音声出力を一時中断し、割込音声入力を音声出力に対する応答として処理する第３応答処理と、割込音声入力を第２音声出力の一つ前に終了した第１音声出力に対する応答として処理する第４応答処理と、中断している音声出力の再生を再開させる第５応答処理と、中断している音声出力を冒頭から再生する第６応答処理と、一つ前に出力した第１音声出力を冒頭から再生する第７応答処理とを実行させるから、割込音声入力が発生した場合におけるあらゆる状況を適切に処理することができる。この結果、対話を混乱させることなく継続させ、必要事項を適正に入力させることができる。
【００１１】
【発明の実施の形態】
図１を用いてこの発明による音声対話制御装置１００を含む音声対話システム２００の概要を説明する。音声対話制御装置１００の入力側には入力音声処理装置１０１が設けられ、出力側に音声出力装置１０２が設けられる。入力音声処理装置１０１では、音声区間検出部１１において、利用者から入力された音声の音声信号からその始端と終端を検出する。入力された音声信号の始端が検出されると、その始端検出信号は音声対話制御装置１００に送られ、音声対話システム２００が音声信号を出力中であれば、割込音声入力が発生したと判定することに用いられる。
入力された音声信号の始端と終端で区切られた音声データは音声認識部１２に入力され、この音声認識部１２で始端と終端で区切られた各音声データを音声認識し、その認識結果を単語列などの文字データ形式で出力する。
【００１２】
音声認識部１２から出力される音声認識結果は、音声理解部１３に渡され、音声理解部１３で利用者が発話で伝えようとしていた意図及び意味内容を特定する。以下では、この利用者の発話内容を「発話意味表現」と呼ぶ。音声理解部１３で特定された発話意味表現は、音声理解部１３から出力され文脈理解部１４に入力され、文脈理解部１４では、利用者により入力された発話に関して認識及び理解した結果である「発話意味表現」と、文脈理解部１４の内部に蓄えられている「履歴情報」との２つを用いて、最新の理解状態を計算する。文脈理解部１４で決定した最新の理解状態は、入力音声処理装置１０１の出力として音声対話制御装置１００に入力される。
音声対話制御装置１００は入力音声処理装置１０１から渡された最新の理解状態を元に、次に音声対話システム２００が出力すべき音声出力となるデータを決定するため、予め用意されている対話シナリオの中から選択を行う。
対話シナリオから選択されて決定した音声出力となるデータは、文字データ形式で出力され、この文字データが音声出力装置１０２に入力される。音声出力装置１０２は音声合成部１５と、音声出力部１６とによって構成される。音声合成部１５は入力された文字列から出力すべき音声データを合成する。音声出力部１６は音声合成部１５で合成された音声データの開始と停止の制御を行う。つまり、音声対話制御装置１００から出力される一時停止指定（第１判定処理）及び出力再開指令（第２・第３判定処理）に従って音声の一時停止と出力再開を制御する。
【００１３】
図２にこの発明による音声対話制御装置１００の内部の実施例を示す。図２に示す点線で囲んだ通常対話制御部１０３は従来から存在する音声対話制御装置と同等の構成を示す。つまり、通常対話制御部１０３は最新理解状態受信手段２１と、始端／終端信号受信手段２２と、対話シナリオ格納手段２３と、対話シナリオ選択手段２４と、文字列出力手段２５と、通常判定手段２６と、通常応答手段２７と、入力項目記録手段２８とを具備して構成される。
一方、図２に示す１０４はこの発明の要部となる割込処理部の構成を示す。割込処理部１０４は、第１判定手段３０−１と、第２判定手段３０−２と、第３判定手段３０−３と、第１応答手段３１−１と、第２応答手段３１−２と、第３応答手段３１−３と、第４応答手段３１−４と、第５応答手段３１−５と、第６応答手段３１−６と、第７応答手段３１−７と、発話時間計測手段３２と、音声停止／再開制御手段３３とによって構成される。尚、これら通常対話制御部１０３と、割込処理部１０４はバスラインＢＵＳによってＣＰＵ１０５に接続され、ＣＰＵ１０５によって全体が制御される。
通常対話制御部１０３では最新理解状態受信手段２１に、文脈理解度１４から出力された利用者からの音声入力に関する最新理解状態が入力されると、その内容に応じて、次に音声出力すべき発話内容を決定するため対話シナリオ選択手段２４が対話シナリオ格納手段２３から対話シナリオを選択し、その選択された対話シナリオを文字列出力手段２５から文字列として出力し、その文字列を用いて音声合成部１５（図１）で音声合成し、音声出力部１６で音声として出力する。
始端／終端信号受信手段２２では、音声区間検出部１１から送られてくる音声入力の発話始端と終端を時刻データとして受信すると共に、システムが出力する音声出力の発話始端及び終端も時刻データとして記録しておき、対話における発話の順序及び割込の発生を監視する。
【００１４】
図３にシステムの音声出力と、利用者からの音声入力の時間関係の一例を示す。各音声出力と、利用者の音声入力に名称を付し、その名称を用いて以下に各部の動作を説明する。図３に示すＳＹＳ−１はシステムから出力した第１音声出力、ＳＹＳ−３は第２音声出力、ＳＹＳ−５は第３音声出力と称することにする。また、ＵＳＲ−２、ＵＳＲ−４、は利用者が音声で入力した第１音声入力及び第２音声入力と称することにする。
第１音声出力ＳＹＳ−１に対する第１音声入力ＵＳＲ−２は、第１音声出力ＳＹＳ−１が終了した時点Ｓ１ｅｎｄから更に時間が経過した時点Ｕ２startで発話が開始されている。このような時間関係にある第１音声入力ＵＳＲ−２は通常対話制御部１０３で処理される。
【００１５】
つまり、通常判定手段２６は利用者の発話始端（この例ではＵ２start）が検出された時点でシステムが音声出力中か否かを判定する。システムが音声出力中でない状態で利用者からの発話が開始された場合は、通常応答手段２７が起動され、通常の応答処理を実行する。ここで、通常の応答処理とは現在入力中の音声入力（この例では第１音声入力ＵＳＲ−２）が、直前に発話が終了した音声出力（この例では第１音声出力ＳＹＳ−１）に対する応答であるものと解釈し、第１音声入力ＵＳＲ−２の内容から入力項目を抽出し、入力項目記憶手段２８に記憶する処理を指す。正常な対話が行われている状況ではこの通常応答処理が繰返され、必要な入力項目が音声入力から抽出されて入力項目記録手段２８に取り込まれる。
【００１６】
図４に通常応答処理の動作をフローチャートで示す。
ステップＳＰ０は利用者からの音声入力待ち受けステップ。
ステップＳＰ１は、利用者から音声入力が発生し、その始端を検出した状況を示す。
ステップＳＰ２は、その始端の時点でシステムから音声が出力されている最中か否かを判定する判定ステップ、この判定ステップを、この明細書では通常判定処理と称することにする。この通常判定処理でシステムの音声出力がされていない状態で利用者からの音声入力が発生した場合は通常状態と判定し、通常応答処理ステップＳＰ３に進む。
ステップＳＰ２でシステムの音声出力中に利用者から割込発話が発生したと判定された場合は割込処理に分岐する。この割込処理は割込処理部１０４で実行される。
ステップＳＰ２で通常状態と判定された場合はステップＳＰ３で入力された音声入力（ＵＳＲ−２）が、その直前に出力された音声出力（ＳＹＳ−１）に対する応答として通常応答処理が実行される。
ステップＳＰ４で入力項目が全て入力されたか否かを判定し未だ未入力の項目が存在した場合はステップＳＰ５で入力項目を更新し、ステップＳＰ０に戻り待機状態となる。
【００１７】
次に、割込処理部１０４の動作について説明する。通常判定手段２６（図２）で割込の発生が検出されると、割込処理部１０４に設けた第１判定手段３０−１が起動され第１判定処理を実行する。割込の発生の様子は図３に示す第２音声出力ＳＹＳ−３が出力されている状況下で利用者から第２音声入力ＵＳＲ−４が発話された状態で表わされる。従って以下では音声入力ＵＳＲ−４を割込音声入力と称すことにする。
第１判定手段３０−１はシステムの第２音声出力ＳＹＳ−３に対して割込音声入力ＵＳＲ−４が発生した時点でシステムが出力中の第２音声出力ＳＹＳ−３の内容の重要度を判定する。この判定には例えば対話シナリオに重要度を示すフラグが付されているか否かによって判定することができる。つまり、対話シナリオ格納手段２３に格納されている対話シナリオには、重要度の高い内容であり、利用者からの割込みがあっても、割込みに応じるよりはシステムからの出力を継続させるべきであり、割込みを受け付けないことを示すフラグが付され、反対に、内容の重要度が必ずしも高くなく、利用者からの割込みが発生した場合には、その割込み入力を優先させて処理すべきシナリオには、フラグが付されていない。もし、フラグが付されている対話シナリオを第２音声出力ＳＹＳ−３として出力している最中に割込音声入力ＵＳＲ−４が発生した場合は、この発明では割込音声入力ＵＳＲ−４を無視し、その対話シナリオの第２音声出力ＳＹＳ−３を最後まで出力する。この処理をここでは第１応答処理（出力強行・割込み無視）と称す。
【００１８】
一方、もし、利用者からの割込み不可を示す重要度フラグが付されていない対話シナリオを音声出力している最中に利用者からの割込音声入力ＵＳＲ−４が発生した場合は、この発明ではその割込音声入力ＵＳＲ−４を尊重し、まず第２音声出力ＳＹＳ−３を一時的に中断させる。この中断処理をここでは第２応答処理（出力の一時停止）と称す。
このように、音声出力の内容の重要度に応じて、重要度が高い場合には第１応答処理を実行することにより、システムから確実に利用者に伝えたい、重要度の高い内容（例えば、システムから利用者への問合せなど）を最後まで確実に利用者に聞かせて伝達することができる。
【００１９】
一方、第２応答処理を実行し、第２音声出力ＳＹＳ−３を図３に示したように一時中断させた場合は、第２判定手段３０−２と、発話時間計測手段３２（図２）が起動される。この第２判定手段３０−２では、システムから音声として出力されている発話の開始時刻や継続時間を計測しておき、その計測結果により、利用者からの割込み発話の開始位置がシステム音声出力ＳＹＳ−３のどの位置で生じたかを算出する。それにより、音声出力ＳＹＳ−３が対話相手である利用者への伝達を完了させたと見なしてよいのか、それとも、音声出力ＳＹＳ−３は出力が開始されはしたものの、実質的には利用者へ伝達が完了しなかったと見なすべきなのかを判定する。この判定をここでは第２判定処理（伝達／未伝達判定）と称す。この第２判定処理に用いる、利用者への伝達が完了したか未伝達かを判定する方法としては、システムが出力している発話の継続時間が、予め設定してある所定時間以上か否かにより判定することができる。設定する所定時間は、システム発話が一時停止させられるまでの継続時間の絶対値であってもよいし、本来、出力するはずであった発話の全体の長さの１/２以上あるいは１/３以上のように比率で設定する等の方法が考えられる。
【００２０】
図５に第１判定処理と第２判定処理を実行する部分のプログラムの概要を説明するためのフローチャートを示す。
ステップＳＰ６は第１判定処理（出力強行・割込み無視）を実行する部分のステップを示す。このステップＳＰ６で再生中の音声出力の内容の重要度を判定する。このステップＳＰ６で重要度が高い、すなわち、利用者からの割込み不可と判定した場合は、ステップＳＰ７及びステップＳＰ８に分岐し、音声出力ＳＹＳ−３を最後まで強制的に再生し、第１応答処理を実行する。
一方、ステップＳＰ６で第２音声出力ＳＹＳ−３の内容の重要度が低い、すなわち、利用者からの割込みに対処すべきと判定した場合は、ステップＳＰ９に分岐し、ステップＳＰ９からＳＰ１０へ進む。
ステップＳＰ１０で音声停止／再開制御手段３３（図２）を起動させ、この音声停止／再開制御手段３３により音声出力部１６（図１）を制御して第２音声出力ＳＹＳ−３の再生を一時停止させ第２応答処理を終了する。第２応答処理の後、ステップＳＰ１１に進む。
【００２１】
ステップＳＰ１１では第２判定処理（伝達／未伝達判定）を実行する。第２判定処理は発話時間計測手段３２で第２音声出力ＳＹＳ−３の出力開始から一時停止までの時間ｄ_SYS3と、第２音声出力ＳＹＳ−３の出力停止位置Ｓ３ｍｉｄの計測値を参照し、利用者に第２音声出力ＳＹＳ−３の内容の伝達が完了したとみなすか否かを判定する。この判定はステップＳＰ１２とＳＰ１４で行われる。
ステップＳＰ１２では時間ｄ_SYS3が予め設定した閾値より長いか、又は停止位置Ｓ３ｍｉｄが第２音声出力の全体の時間の１／２を過ぎた後半部である場合はステップＳＰ１３に示すように第２音声出力ＳＹＳ−３の内容は利用者に伝達された（「伝達済」）と判定する処理を行う。
一方、もしステップＳＰ１４で時間ｄ_SYS3が閾値より短いか、又は出力停止位置Ｓ３ｍｉｄが第２音声出力ＳＹＳ−３の中の前半部であったと判定された場合は、
【００２２】
ステップＳＰ１５で第２音声出力ＳＹＳ−３の内容は利用者に「未伝達」として扱うものと処理する。
ステップＳＰ１３又はＳＰ１５を実行した後、ステップＳＰ１６に進み割込音声ＵＳＲ−４の終端が検出されるのを待つ。
割込音声入力ＵＳＲ−４の終端が検出されると、図６に示すステップＳＰ１７の第３判定処理（断片・非断片判定）に進む。
ステップＳＰ１７で行う第３判定処理は割込音声入力ＵＳＲ−４の内容の有無を解析し、この内容の有無と、第２判定処理でのステップＳＰ１３とＳＰ１５で判定した利用者への「伝達済」及び「未伝達」との組合せに応じて４つの状態に分類する。
【００２３】
ステップＳＰ１８は第１の分類を選択したステップを示す。第１の分類ステップＳＰ１８では第２音声出力ＳＹＳ−３の内容（発話量）が利用者に「伝達済」であることと、割込音声入力ＵＳＲ−４の内容が充分存在し、「非断片発話」である場合に選択される。
このステップＳＰ１８が選択された場合はステップＳＰ２２で割り込み音声入力ＵＳＲ−４を第２音声出力ＳＹＳ−３に対する応答として処理を行う。つまり、割込音声入力ＵＳＲ−４の音声認識結果を第２音声出力ＳＹＳ−３に対する応答として入力項目記録手段２８（図１）に記録する（ステップＳＰ２９）。これを第３応答処理と称する。第３応答処理は、システムからの第２音声出力ＳＹＳ−３は無事利用者に伝達されており、かつ、ＳＹＳ−３に割込んだ利用者からの割込み入力ＵＳＲ−４にも十分な内容がある場合であり、システムは、割込まれたＵＳＲ−４の入力に対して、次の応答を行えばよい。
【００２４】
ステップＳＰ１９は第２の分類を選択したステップを示す。第２の分類ステップＳＰ１９では第２音声出力ＳＹＳ−３が利用者に「未伝達」で、割込音声出力ＵＳＲ−４の発話が充分存在し、「非断片発話」である場合に選択される。この分類の場合は、利用者は、システムからの第２音声出力ＳＹＳ−３を聞き取った上でＵＳＲ−４の割込み入力を行ったと考えるのではなく、ひとつ前の段階でシステムから出力された第１音声出力ＳＹＳ−１に対して応答したと考えるべきである。つまり、第１音声出力ＳＹＳ−１に対して、利用者は第１音声入力ＳＹＳ−２で応答したが、それに加えて、さらに応答を続けるため、第２音声入力ＵＳＲ−４を発声したと考える。第２音声入力ＵＳＲ−４の開始時刻（Ｕ４start）が、たまたま、第２音声出力ＳＹＳ−３の開始時刻（Ｓ３start）より後になってしまった場合である。
従ってこのステップＳＰ１９が選択された場合はステップＳＰ２３で割り込み音声入力ＵＳＲ−４を第１音声出力ＳＹＳ−１に対する応答として処理を行う。従って、この場合にはステップＳＰ２９で割込音声入力ＵＳＲ−４の音声認識結果を第１音声出力ＳＹＳ−１に対する応答として入力項目記録手段２８（図１）に記録する。この処理をここでは第４応答処理と称す。
【００２５】
ステップＳＰ２０は第３の分類を選択したステップを示す。第３の分類ステップＳＰ２０では第２音声出力ＳＹＳ−３が利用者に「伝達済」で、割込音声出力ＵＳＲ−４の発話量が少なく、「断片発話」である場合に選択される。この第３の分類の場合は、利用者は、システムからの第２音声出力ＳＹＳ−３を聞き取ったものの、何か具体的に伝えたいことがあって、割込み入力ＵＳＲ−４を発声した訳ではなく、何か言いかけてやめたか、相槌などの発話をしただけ、と考える。つまり、システムは、利用者が割込んできた第２音声入力ＵＳＲ−４に対して応答する必要はなく、いったん停止させてしまった第２音声出力ＳＹＳ−３を再開させるべきと考えられる。
従ってこのステップＳＰ２０が選択された場合はステップＳＰ２４に進みステップＳＰ２４で第２音声出力ＳＹＳ−３の再生を再開する処理を実行する。この処理をここでは第５応答処理と称す。
ステップＳＰ２１は第４の分類を選択したステップを示す。この第４分類ステップＳＰ２１では第２音声出力ＳＹＳ−３が利用者に「未伝達」で、割込音声入力ＵＳＲ−４の発話量が少なく「断片発話」である場合に選択される。
このステップＳＰ２１が選択された場合は更に２つのステップＳＰ２５とＳＰ２６に分類される。
【００２６】
ステップＳＰ２５はステップＳＰ２１で規定した条件「ＳＹＳ−３が未伝達」、「ＵＳＲ−４が断片発話」に加えて第１音声入力ＵＳＲ−２に発話が存在した条件（図６ではＵＳＲ−２≠０と表示）を加える。この条件を満たす状況としては、第１音声出力ＳＹＳ−１に対して、利用者は第１音声入力ＵＳＲ−２の発話を行ったものの、「ＳＹＳ−３が未伝達」であることから、続く第２音声出力ＳＹＳ−３を聞き取った上で第２音声入力ＵＳＲ−４を返した訳ではなく、第１音声入力に続く発話として第２音声入力ＵＳＲ−４を発話し始めたものの、たまたま第２音声出力ＳＹＳ−３と重なってしまったため、第２音声入力ＵＳＲ−４を途中でやめてしまい、断片発話になったと考えられる。従って、システム・利用者とも、ＳＹＳ−３・ＵＳＲ−４を話そうとしてやめてしまったような状況と考えられるため、この状況では、システムが当初の予定通り、第２音声出力ＳＹＳ−３を、もう一度、発話し始めるべきと考えられる。このために、ステップＳＰ２５が選択された場合にはステップＳＰ２７に進み、このステップＳＰ２７で第２音声出力ＳＹＳ−３を冒頭から再生を始める処理を実行する。この処理をここでは第６応答処理と称す。
この第６応答処理により利用者には第２音声出力ＳＹＳ−３が冒頭から伝達され、利用者により、第２音声出力ＳＹＳ−３に対する応答が入力されることを待ち受ける。
【００２７】
一方、第１音声入力ＵＳＲ−２が無音声（図６ではＵＳＲ−２＝０と表示している）であった場合にはステップＳＰ２６で第１音声出力ＳＹＳ−１に対する応答が未だ確定していないと判定する。つまり、第１音声出力ＳＹＳ−１に対して、利用者は何らかの理由で応答するのが遅れてしまい、やっと応答を始めようとしたのだが、システムの側では、第１音声出力ＳＹＳ−１に対して、利用者からの応答が何もないため、第１音声入力ＵＳＲ−２に相当する応答は無音であったと判断し、無音による応答に対する次の発話、第２音声出力ＳＹＳ−３を開始しようとしたところに、たまたま利用者からの第２音声入力ＵＳＲ−４が重なるように始まってしまったと考えることができる。この場合、システム・利用者とも、ＳＹＳ−３・ＵＳＲ−４を話そうとしてやめてしまった状況であるが、利用者が話し始めてやめてしまったＵＳＲ−４は、第１音声出力ＳＹＳ−１に対する応答であったであろうと考えることができるので、システムは、第１音声出力ＳＹＳ−１に対する応答を促すため、もう一度、第１音声出力ＳＹＳ−１を繰り返すことが望ましいと考えられる。従って、ステップＳＰ２６が選択された場合はステップＳＰ２８で第１音声出力ＳＹＳ−１を冒頭から再生し直し、利用者から第１音声出力ＳＹＳ−１に対する応答が入力されるのを待ち受ける。この処理をここでは第７応答処理と称す。
これら、第５応答処理、第６応答処理、第７応答処理を実行した後はステップＳＰ３０では、第５応答処理（第２音声出力ＳＹＳ−３の再開）、第６応答処理（第２音声出力ＳＹＳ−３の冒頭からの再生し直し）、第７応答処理（第１音声出力ＳＹＳ−１の再生し直し）で出力されたシステムからの発話に対して、利用者から入力される音声入力を待ち受けて待機する処理を行う。
以上説明した各応答処理動作例を図７乃至図１２を用いて説明する。
【００２８】
図７は通常応答処理の対話例を示す。第１音声出力ＳＹＳ−１で「お探しになりたいエリアを入力してください」と音声が出力される。音声の終了後に利用者が第１音声入力ＵＳＲ−２で「そうですねー、じゃぁ新宿でお願いします」と回答する。
回答の終了後にシステムから第２音声出力ＳＹＳ−３で「エリアは新宿でよろしいですか？」と確認の問い合わせが出力される。音声出力ＳＹＳ−３が終了した後で利用者から第２音声入力ＵＳＲ−４で「はいそうです」と回答する。システムは音声入力ＵＳＲ−４を受理することで、ＳＹＳ−１で出力されたエリアに関する質問を完了させて次の質問事項に移り、音声出力ＳＹＳ−５で「次に、ジャンルを教えて下さい」と出力する。このように割込が発生することもなく、音声出力と音声入力が交互に繰返されることにより通常の応答処理が実行される。
【００２９】
図８は上述した第３応答処理の対話例を示す。第１音声出力ＳＹＳ−１と第１音声入力ＵＳＲ−２に対しては通常応答処理と、同じ手順で処理が実行される。ここでは第２音声出力ＳＹＳ−３に対して、第２音声入力ＵＳＲ−４が割込により生じた例を示す。第２音声入力ＵＳＲ−４が割込を発生させたが、割込音声入力ＵＳＲ−４の発生位置が、第２音声出力ＳＹＳ−３の中央より後方であるとした場合、第２判定手段３０−２は第２音声出力ＳＹＳ−３の内容が利用者に「伝達済」と判定し、更に、割込音声入力ＵＳＲ−４は「はいそうです」と発話され「非断片発話」であるため、図６に示したステップＳＰ２２で第２音声入力ＵＳＲ−４を第２音声出力ＳＹＳ−３に対する回答として処理する。その結果割込音声入力ＵＳＲ−４の終端を待って次の入力項目を問い合わせる第３音声出力ＳＹＳ−５が出力される。
【００３０】
図９は第４応答処理の対話例を示す。この例では第２音声出力ＳＹＳ−３に対してその開始直後に割込音声入力ＵＳＲ−４が発生した状態を示す。割込が第２音声出力ＳＹＳ−３の出力開始直後であることから、第２音声出力ＳＹＳ−３の内容は利用者に「未伝達」と判定される。これに対し、割込音声入力ＵＳＲ−４は「だから、新宿ですよ」と「非断片発話」であるため、図６に示したステップＳＰ１９が選択され、割込音声入力ＵＳＲ−４を第１音声出力ＳＹＳ−１に対する回答として処理を行う。この結果として、第２音声出力ＳＹＳ−３は存在しなかったものとして扱い、割込音声入力ＵＳＲ−４を第１音声出力ＳＹＳ−１に対する回答として扱うため、第３音声出力ＳＹＳ−５として「はいすみません、新宿、了解です」のように、本来、出力されるはずであった第２音声出力ＳＹＳ−３とは異なる対話シナリオが選択され、利用者に音声出力が行われる。
【００３１】
図１０は第５応答処理の対話例を示す。この例では第２音声出力ＳＹＳ−３の再生開始から中央部分を経過した時点で十分な内容を持たない認識結果である、割込音声入力ＵＳＲ−４が発生した場合を示す。第２音声出力ＳＹＳ−３は半分以上が再生されているから利用者には「伝達済」として処理される。また、割込音声入力ＵＳＲ−４は「断片発話」と判定するから、この場合には図６に示したステップＳＰ２０が選択され、ステップＳＰ２０が選択されることにより、割込音声入力ＵＳＲ−４の内容には対処せず、かつ、第２音声出力ＳＹＳ−３は利用者に伝達済と扱うので、音声出力ＳＹＳ−３の再生が、割込みにより一時停止した位置から再開される。尚、この例では再生の再開位置を中断位置より開始点側に採り、再生済みの内容を一部重複させて再生することで認識性を高める工夫を施している。
第２音声出力ＳＹＳ−３の再生を再開し、その終了後に音声入力ＵＳＲ−５により「はいＯＫです」と入力することにより、音声出力ＳＹＳ−１で行った質問に対する回答がＳＹＳ−３を経て確定される。
【００３２】
図１１は第６応答処理の対話側を示す。この例では第２音声出力ＳＹＳ−３の出力開始直後に十分な内容を持たない認識結果である、割込音声入力ＵＳＲ−４が発生した場合を示す。この場合には、割込音声入力ＵＳＲ−４の開始点は、第２音声出力ＳＹＳ−３の開始直後であるため、利用者には第２音声出力の内容は「未伝達」と判定される。更に、割込音声入力ＵＳＲ−４も「断片発話」であるため、図６に示したステップＳＰ２１が選択される。更に、ここでは第１音声入力ＵＳＲ−２が有音であるため、第１音声出力ＳＹＳ−１で行われた質問に対する回答は既に第１音声入力ＵＳＲ−２で行われているものと判定される。従って、この場合には第２音声出力ＳＹＳ−３を冒頭から再生し、利用者に第２音声出力ＳＹＳ−３に対する回答を求める。
【００３３】
図１２は第７応答処理の対話例を示す。この例では第１音声出力ＳＹＳ−１に対して、利用者は第１音声入力ＵＳＲ−２で発声を行わずに応答が無音であり、更に、第２音声出力ＳＹＳ−３の開始直後に十分な内容を持たない認識結果である、割込音声入力ＵＳＲ−４が発生した場合を示す。
この場合には、第２音声出力ＳＹＳ−３は開始直後に中断されているから、その内容は利用者に「未伝達」と判定される。更に、割込音声入力ＵＳＲ−４も「断片発話」と判定されるから、図６に示したステップＳＰ２１が選択される。ステップＳＰ２１が選択されて更に第１音声入力ＵＳＲ−２が無回答であるから、この場合にはステップＳＰ２６に進み、ステップＳＰ２６で第1音声入力ＵＳＲ−２に対する回答がまだ行われていないと判定され、第１音声出力ＳＹＳ−１がもう一度再生され、利用者に第１音声出力ＳＹＳ−１に対する回答を求める。
【００３４】
以上説明した音声対話制御方法及び音声対話制御装置はこの発明による音声対話制御プログラムをコンピュータにインストールし、コンピュータに解読させて実行させることにより実現される。この発明による音声対話制御プログラムはコンピュータが解読可能な符号列によって記述され、コンピュータが読み取り可能な記録媒体に記録され、この記録媒体から読み出されてコンピュータにインストールするか又は記録媒体から読み出され、通信回線を通じてコンピュータにインストールされる。
【００３５】
【発明の効果】
以上説明したように、この発明によればシステムの音声出力中に利用者が割込音声入力を発生しても、その発生状況を複数に分類して処理し、あらゆる状況の割込に対しても適切な対話の処理と、入力値の取得の処理を実行するから、対話が混乱に陥ることもなく、利用者に負担を掛けることなく必要事項の入力を行わせることができる。
【図面の簡単な説明】
【図１】この発明を適用して好適な音声対話装置の一例を説明するためのブロック図。
【図２】この発明の要部の構成を説明するためのブロック図。
【図３】この発明の動作を説明するためのタイミングチャート。
【図４】図２に示した通常対話制御部と、この発明の要部となる割込処理部との動作の分岐部分を説明するためのフローチャート。
【図５】この発明で提案する第１応答処理と、第２応答処理の動作を説明するためのフローチャート。
【図６】この発明で提案する第３応答処理乃至第７応答処理の動作を説明するためのフローチャート。
【図７】図２に示した通常対話制御部の通常対話処理時の対話例を説明するためのタイミングチャート。
【図８】この発明で提案する第３応答処理の対話例を説明するためのタイミングチャート。
【図９】この発明で提案する第４応答処理の対話例を説明するためのタイミングチャート。
【図１０】この発明で提案する第５応答処理の対話例を説明するためのタイミングチャート。
【図１１】この発明で提案する第６応答処理の対話例を説明するためのタイミングチャート。
【図１２】この発明で提案する第７応答処理の対話例を説明するためのタイミングチャート。
【図１３】従来の技術の一例を説明するためのタイミングチャート。
【図１４】従来の技術の他の例を説明するためのタイミングチャート。
【符号の説明】
１００音声対話制御装置２８入力項目記録手段
１０１入力音声処理装置２００音声対話システム
１０２音声出力装置３０−１第１判定手段
１０３通常対話制御部３０−２第２判定手段
１０４割込処理部３０−３第３判定手段
１０５ＣＰＵ３１−１第１応答手段
１１音声区間検出部３１−２第２応答手段
１２音声認識部３１−３第３応答手段
１３音声理解部３１−４第４応答手段
１４文脈理解部３１−５第５応答手段
１５音声合成部３１−６第６応答手段
１６音声出力部３１−７第７応答手段
２１最新理解状態受信手段３２発話時間計測手段
２２始端／終端信号受信手段３３音声停止／再開制御手段
２３対話シナリオ格納手段ＳＹＳ−１第１音声出力
２４対話シナリオ選択手段ＳＹＳ−３第２音声出力
２５文字列出力手段ＵＳＲ−２第１音声入力
２６通常判定手段ＵＳＲ−４割込音声入力
２７通常応答手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice dialogue control method, a voice dialogue control apparatus, and a voice dialogue control program that can be used for various reservation systems, and more particularly, even when voice input from a user interrupts during voice output from the system. An object of the present invention is to provide a voice dialogue control method, a voice dialogue control device, and a voice dialogue control program capable of smoothly executing dialogue processing.
[0002]
[Prior art]
Conventionally, various voice interaction apparatuses, for example, Patent Document 1 have been proposed. In Patent Document 1, when a user speaks and an interrupt occurs during voice output from the system, the voice output from the system is temporarily interrupted, and according to the content of the interrupt utterance from the user The control method (patent document 1, FIG.1, FIG.5, FIG.7) which determines whether audio | voice output is restarted or complete | finished is proposed.
The state of the control will be briefly described with reference to FIG. Time T₀Then, the audio output SYS-1 from the system starts to be reproduced. Originally, this audio output SYS-1 is continuously reproduced until time Tn. However, here time T₁When the user speaks the voice USR-2, the duration of the interrupted voice input USR-2 is measured. If the duration is longer than a certain threshold, the voice reproduction is stopped. If the duration is shorter than the certain threshold, the voice reproduction is stopped. Has been proposed, and a method of reproducing the continuation of the audio output SYS-1 is proposed.
Further, in claim 4 of Patent Document 1, the contents of the interrupt voice input USR-2 are analyzed, and if the contents of the interrupt voice input USR-2 are positive, the continuation of the voice output SYS-1 is resumed and negated. We propose a method to stop audio output if it is appropriate.
[0003]
[Patent Document 1]
JP-A-8-146991
[0004]
[Problems to be solved by the invention]
In Patent Document 1, when an interrupt voice input is generated from a user during voice output from the system, the voice output from the system is resumed or stopped depending on the content and duration of the interrupt voice input. It is controlled by an alternative method. According to the control method based on this alternative method, even when a description of a facility is reproduced from the system, even if there is a short interruption such as “Eh!”, The explanation is temporarily reproduced by the utterance. Although interrupted, playback resumes again. Also, if a relatively long interruption such as “Please quit” or “Thank you” is uttered, the reproduction of the explanation is stopped and the reproduction of the explanatory text can be terminated halfway.
This method according to the prior art is whether or not the user wants to stop the system audio output when the system audio output continues to be output for a relatively long time as explained or presented to the user. The purpose is to determine whether to continue listening to the system audio output.
However, in a dialogue system in which the utterance right of dialogue between the question from the system and the response from the user is frequently changed, and in a system that always accepts the interruption voice input from the user, If a more detailed response is not made according to the content and length of the interrupted voice input and the interrupted position, the utterance is confused and the conversation does not proceed smoothly.
[0005]
For example, when an interrupt utterance is input from the user at a timing of about several hundred milliseconds, which is almost the same as when a question is made to the user from the system, the user outputs the system voice output. It is more likely that the user was trying to start speaking before starting to hear the system audio output, rather than whether to stop. Therefore, the system audio output that pauses the output due to the interruption from the user is not sufficient to continue or end the reproduction, and the interrupt input audio from the user is the system audio output to which It is important to determine whether it is a response.
Also, in the situation where the start of the system output voice and the input of the interrupt voice from the user are almost simultaneous, the user was surprised at the start of the system output voice even though he started to speak. There is a tendency to stop the voice input halfway. Therefore, if there is an interrupt voice input from the user and the recognition result is used as it is, the recognition result for the fragmented utterance stopped by the user may not contain sufficient content. On the other hand, it may be a factor to confuse the dialogue. Therefore, for interrupted speech input from users, the length of the input speech, the recognition likelihood, the content of the recognition result, etc. are comprehensively judged and treated as a fragmentary utterance, and the recognition result is not used It is necessary to determine whether the recognition result is acceptable as a non-fragmented utterance.
[0006]
With reference to FIG. 14, an example in which the system should change the next output voice in response to a difference in interrupt voice input from the user will be described. For example, after inquiring about the boarding date and time with the system voice output SYS-1 and responding with "March 11" with the voice input USR-2 from the user, the voice for inquiring about the station name of the boarding station as the next input item Immediately after the start of the output SYS-3, if the user speaks “Tokyo Station” with the interrupt voice input USR-4, the system stops the subsequent playback of the voice output SYS-3, and the station name “Tokyo Station” "As an input value, and the input value for the audio output SYS-3 can be determined.
However, as shown in FIG. 14C, after inputting “March 11” to the audio output SYS-1 by the audio input USR-2 ′, the interrupt audio input USR− If the utterance is “Ah, after all, March 10” in 4 ′, the system simply cancels the voice output SYS-3, and even if “March 10” is input to the input item of the station name. When the system waits for recognition processing as the boarding station is input from the user, the interrupt voice input USR-4 ′ does not refer to the boarding station, so the system uses the voice output SYS-3. The input value from the driver cannot be confirmed, and the voice output SYS-5 again makes an inquiry to prompt the user to enter the boarding station name, such as “Please enter the boarding station name”. Dialogue between the user and the system Likely to cause gastric situation.
[0007]
An object of the present invention is to analyze a user's utterance content even if the user's utterance generates an interruption to the voice output of the system in the voice interaction device, and maintain a conversation situation intended by the user, A voice dialogue control method, a voice dialogue control device, and a voice dialogue control program capable of correcting an input item whose input value has already been determined are proposed.
[0008]
[Means for Solving the Problems]
In this invention, every time voice output from the system is finished, a response from the user corresponding to this voice output is inputted by voice, the system recognizes this voice input, and the voice recognition result is finished immediately before. In a voice dialogue control method that executes a normal response process acquired as an input value corresponding to a voice output, and repeats a dialogue operation that generates a voice output again after the normal response process, the first voice output of the system ends, and the second If an interrupt voice input occurs during the voice output, the second voice output is continued based on the voice output stop / non-stop information determined in advance in consideration of the importance of the content of the second voice output, In the first determination process for determining whether to pause, and in the first determination process, when it is determined to temporarily stop the second audio output, the start of the interrupt audio input from the user from the start of the second audio output. The interrupt position from the user during the second audio output is acquired by measuring the time required until the time until the interrupt position time is earlier than a predetermined value set in advance. Is determined not to be transmitted to the user, and if the time at the interrupt position is later than a predetermined value set in advance, the content of the second audio output is determined to have been transmitted to the user, and from the user The interrupted speech was non-fragmented utterance with sufficient content from the recognition result, recognition likelihood, duration of recognized speech, etc., or fragmented utterance without sufficient content And the third determination process for determining whether or not the second sound output is continued in the first determination process according to the determination results of the first determination process, the second determination process, and the third determination process. If judged, ignore the interrupt voice input and continue the second voice output A first response process for executing the operation to output the second audio output, and a second response process for executing a process for temporarily stopping the reproduction of the second audio output when it is determined in the first determination process that the second audio output is to be paused. If the second determination process determines that the transmission has been completed, and the third determination process determines that the interrupt input is non-fragmented speech, the third process performs a process of capturing the interrupt voice input as a response to the second voice output. If the response process and the second determination process determine that the transmission has not been performed, and if the third determination process determines that the interrupted voice input is a non-fragmented speech, a process of capturing the interrupted voice input as a response to the first voice output When the fourth response process to be performed and the second determination process determine that the transmission has been completed, and when the third determination process determines to be a fragment utterance, the second audio output stopped by the second response process is resumed. It is determined that the response process and the second determination process are not transmitted. A sixth response process for outputting the reproduction of the second voice output stopped by the second response process from the beginning when the interrupted voice input is determined to be a fragment utterance in the third determination process; In the second determination process, it is determined that transmission has not been performed. In the third determination process, the interrupted voice input is determined to be a fragment utterance. When there is no response from the user to the first voice output, the first voice output is started. A voice interaction control method for executing any one of the seventh response processing to be output again from the above is proposed.
[0009]
In the present invention, each time the first voice output from the system is completed, a response from the user corresponding to the voice output is inputted by voice, the voice input is voice-recognized by the system, and the voice recognition result is immediately before. In the voice interaction control device that executes a normal response process that is acquired as an input value corresponding to the first voice output that ended at the time, and repeats the dialog operation that generates the second voice output again after the normal response process, the first voice of the system When the output is finished and an interruption voice input is generated from the user during the second voice output, based on the voice output stop / non-stop information determined in advance in consideration of the importance of the contents of the second voice output. The first determination means for determining whether the second sound output is to be continued or paused, and when the first determination means determines that the second sound output is to be paused, the second sound output is used from the start of the second sound output. By measuring the time required to start the interrupt voice input from the user, the interrupt position from the user during the second voice output is obtained, and the time of the interrupt position is determined from a predetermined value set in advance. If it is early, it is determined that the content of the second audio output has not been transmitted to the user, and if the time of the interrupt position is later than a predetermined value, it is determined that the content of the second audio output has been transmitted to the user. Whether the interrupted speech is a non-fragmented utterance with sufficient content, based on the recognition result, the recognition likelihood, the duration of the recognized speech, etc., with respect to the second determination means and the interrupted speech input from the user And a third determination means for determining whether the utterance is a fragmentary utterance without accompanying contents, and according to each determination result of the first determination means, the second determination means, and the third determination means, the first determination means If it is determined that the second audio output will be continued at Then, when it is determined that the second sound output is temporarily stopped by the first response means for executing the operation of continuously outputting the second sound output and the first determination means, the reproduction of the second sound output is temporarily stopped. If the second determination means determines that the transmission is completed by the second determination means, and the third determination means determines that the interrupt input is non-fragmented speech, the interrupt sound input is captured as a response to the second sound output. The third response means for processing and the second determination means determine that the transmission is not performed, and if the third determination means determines that the interrupted voice input is non-fragmented speech, the interrupted voice input is set to the first voice output. When the fourth response means that performs processing to capture as a response, the second determination means determines that the transmission has been completed, and the third determination means determines that the fragment utterance is generated, the second voice output stopped by the second response means A fifth response means for restarting reproduction and a second determination If the third determination means determines that the utterance is a fragment utterance, the sixth response means re-outputs the reproduction of the second audio output stopped by the second response means from the beginning, and the second determination If it is determined that the transmission is not transmitted by the means, the interrupting voice input is determined to be a fragment utterance by the third determining means, and if there is no response from the user to the first voice output, the first voice output is output from the beginning. A spoken dialogue control apparatus constituted by the seventh response means to be redone is proposed.
The present invention further proposes a spoken dialogue control program that is described by a computer-readable code string and causes the computer to execute the above-described spoken dialogue control method.
[0010]
Action
According to the present invention, the first determination process, the second determination process, and the third determination process are performed, and the audio output is forcibly reproduced in addition to the normal response process according to the determination result of each determination process. The first response process to be continued, the second response process to suspend the voice output, the third response process to suspend the voice output and process the interrupt voice input as a response to the voice output, and the interrupt voice input to the first 4th response process processed as a response with respect to the 1st audio | voice output complete | finished immediately before 2 audio | voice output, 5th response process which restarts reproduction | regeneration of the audio | voice output interrupted, and the audio | voice output interrupted first Since the sixth response process that reproduces from the first and the seventh response process that reproduces the first voice output that was output immediately before are executed from the beginning, all situations when an interrupt voice input occurs are appropriately processed be able to. As a result, the dialogue can be continued without confusion and necessary items can be input appropriately.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
An outline of a voice dialogue system 200 including the voice dialogue control apparatus 100 according to the present invention will be described with reference to FIG. An input voice processing device 101 is provided on the input side of the voice interaction control device 100, and a voice output device 102 is provided on the output side. In the input speech processing apparatus 101, the speech section detection unit 11 detects the start and end points from the speech signal of the speech input from the user. When the start edge of the input voice signal is detected, the start edge detection signal is sent to the voice conversation control device 100. If the voice conversation system 200 is outputting a voice signal, it is determined that an interrupt voice input has occurred. Used to do.
The voice data divided at the beginning and end of the input voice signal is inputted to the voice recognition unit 12, and the voice recognition unit 12 recognizes each voice data divided at the beginning and end and recognizes the recognition result as a word. Output in character data format such as column.
[0012]
The speech recognition result output from the speech recognition unit 12 is passed to the speech understanding unit 13, and the speech understanding unit 13 identifies the intention and meaning content that the user was trying to convey by utterance. Below, this user's utterance content is called "utterance meaning expression". The utterance meaning expression specified by the voice understanding unit 13 is output from the voice understanding unit 13 and input to the context understanding unit 14. The context understanding unit 14 recognizes and understands the utterance input by the user. The latest understanding state is calculated using two of “utterance meaning expression” and “history information” stored in the context understanding unit 14. The latest understanding state determined by the context understanding unit 14 is input to the spoken dialogue control apparatus 100 as an output of the input voice processing apparatus 101.
The spoken dialogue control apparatus 100 determines a data to be a voice output to be output next by the voice dialogue system 200 based on the latest understanding state passed from the input voice processing apparatus 101. Select from the following.
Data that is selected from the dialogue scenario and determined as voice output is output in a character data format, and this character data is input to the voice output device 102. The voice output device 102 includes a voice synthesizer 15 and a voice output unit 16. The voice synthesizer 15 synthesizes voice data to be output from the input character string. The voice output unit 16 controls the start and stop of the voice data synthesized by the voice synthesis unit 15. That is, the pause and output restart of the voice are controlled in accordance with the pause designation (first determination process) and the output restart command (second / third determination process) output from the spoken dialogue control apparatus 100.
[0013]
FIG. 2 shows an internal embodiment of the voice interaction control device 100 according to the present invention. The normal dialogue control unit 103 surrounded by a dotted line shown in FIG. 2 has a configuration equivalent to that of a conventional voice dialogue control device. That is, the normal dialogue control unit 103 includes the latest understanding state receiving unit 21, the start / end signal receiving unit 22, the dialogue scenario storage unit 23, the dialogue scenario selection unit 24, the character string output unit 25, and the normal determination unit 26. And a normal response means 27 and an input item recording means 28.
On the other hand, reference numeral 104 shown in FIG. 2 indicates the configuration of an interrupt processing unit which is a main part of the present invention. The interrupt processing unit 104 includes a first determination unit 30-1, a second determination unit 30-2, a third determination unit 30-3, a first response unit 31-1, and a second response unit 31-2. The third response means 31-3, the fourth response means 31-4, the fifth response means 31-5, the sixth response means 31-6, the seventh response means 31-7, and the speech time measurement. It comprises means 32 and voice stop / restart control means 33. The normal dialogue control unit 103 and the interrupt processing unit 104 are connected to the CPU 105 by the bus line BUS, and are controlled by the CPU 105 as a whole.
When the latest understanding state regarding the voice input from the user output from the context understanding level 14 is input to the latest understanding state receiving unit 21, the normal dialogue control unit 103 should output the next voice according to the content. In order to determine the utterance content, the dialogue scenario selection unit 24 selects a dialogue scenario from the dialogue scenario storage unit 23, outputs the selected dialogue scenario as a character string from the character string output unit 25, and uses the character string to generate a voice. The voice is synthesized by the synthesizer 15 (FIG. 1) and output as voice by the voice output unit 16.
The start / end signal receiving means 22 receives the speech input utterance start and end sent from the speech section detection unit 11 as time data, and also records the speech output utterance start and end output by the system as time data. In addition, the order of utterances and occurrences of interrupts in the dialogue are monitored.
[0014]
FIG. 3 shows an example of the time relationship between the sound output of the system and the sound input from the user. Names are assigned to each voice output and user voice input, and the operation of each unit will be described below using the names. In FIG. 3, SYS-1 is referred to as a first audio output output from the system, SYS-3 is referred to as a second audio output, and SYS-5 is referred to as a third audio output. Further, USR-2 and USR-4 are referred to as a first voice input and a second voice input which are input by the user by voice.
The first voice input USR-2 with respect to the first voice output SYS-1 starts to be uttered at a time point U2start when a further time elapses from the time point S1end when the first voice output SYS-1 ends. The first voice input USR-2 having such a time relationship is normally processed by the dialogue control unit 103.
[0015]
That is, the normal determination means 26 determines whether or not the system is outputting audio when the user's utterance start point (U2start in this example) is detected. When an utterance from the user is started while the system is not outputting voice, the normal response means 27 is activated to execute normal response processing. Here, the normal response process refers to a voice input being currently input (first voice input USR-2 in this example) to a voice output (first voice output SYS-1 in this example) that has just been uttered. This is a process of interpreting a response, extracting an input item from the content of the first voice input USR-2, and storing it in the input item storage means 28. In a situation where a normal dialogue is being performed, this normal response process is repeated, and necessary input items are extracted from the voice input and taken into the input item recording means 28.
[0016]
FIG. 4 is a flowchart showing the normal response processing operation.
Step SP0 is a step for waiting for voice input from the user.
Step SP1 shows a situation in which a voice input is generated from the user and the starting edge is detected.
Step SP2 is a determination step for determining whether or not sound is being output from the system at the start point, and this determination step is referred to as a normal determination process in this specification. If voice input from the user is generated in the state where the system is not outputting sound in the normal determination process, it is determined as the normal state, and the process proceeds to the normal response process step SP3.
If it is determined in step SP2 that an interrupt utterance has occurred from the user during the voice output of the system, the process branches to an interrupt process. This interrupt process is executed by the interrupt processing unit 104.
If the normal state is determined in step SP2, the normal response process is executed as a response to the voice output (SYS-1) output immediately before the voice input (USR-2) input in step SP3.
In step SP4, it is determined whether or not all input items have been input. If there are items that have not yet been input, the input items are updated in step SP5, and the process returns to step SP0 and enters a standby state.
[0017]
Next, the operation of the interrupt processing unit 104 will be described. When occurrence of an interrupt is detected by the normal determination unit 26 (FIG. 2), the first determination unit 30-1 provided in the interrupt processing unit 104 is activated and executes the first determination process. The state of occurrence of the interruption is expressed in a state where the second voice input USR-4 is uttered by the user in a situation where the second voice output SYS-3 shown in FIG. 3 is being output. Therefore, hereinafter, the voice input USR-4 is referred to as an interrupt voice input.
The first determination means 30-1 determines the importance of the content of the second audio output SYS-3 being output by the system when the interrupt audio input USR-4 is generated for the second audio output SYS-3 of the system. judge. This determination can be made based on, for example, whether or not a flag indicating the importance level is attached to the dialogue scenario. In other words, the dialogue scenario stored in the dialogue scenario storage means 23 has high importance, and even if there is an interruption from the user, the output from the system should be continued rather than responding to the interruption. When a flag indicating that an interrupt is not accepted is attached, and the importance of the content is not necessarily high and an interrupt from the user occurs, a scenario that should be processed with priority on the interrupt input The flag is not attached. If the interrupt voice input USR-4 occurs while the flagged dialogue scenario is being output as the second voice output SYS-3, the present invention uses the interrupt voice input USR-4. Ignore and output the second audio output SYS-3 of the dialogue scenario to the end. This process is referred to herein as first response process (output forced / interrupt ignore).
[0018]
On the other hand, if an interrupt voice input USR-4 from a user is generated during voice output of a dialogue scenario without an importance level flag indicating that a user cannot interrupt, the present invention Then, the interrupt voice input USR-4 is respected, and first, the second voice output SYS-3 is temporarily interrupted. This interruption processing is referred to herein as second response processing (output suspension).
As described above, according to the importance of the content of the audio output, when the importance is high, the first response process is executed, so that the system is surely notified to the user with high importance (for example, Inquiries from the system to users, etc.) can be transmitted to the user without fail.
[0019]
On the other hand, when the second response process is executed and the second voice output SYS-3 is temporarily interrupted as shown in FIG. 3, the second determination means 30-2 and the utterance time measurement means 32 (FIG. 2). Is activated. In the second determination means 30-2, the start time and duration of the utterance output as speech from the system are measured, and the start position of the interrupted utterance from the user is determined based on the measurement result as the system speech output SYS. -3 is calculated at which position. As a result, it may be considered that the audio output SYS-3 has completed the transmission to the user who is the conversation partner, or the audio output SYS-3 has started to be output, but substantially to the user. Determine if the transmission should be considered incomplete. This determination is referred to herein as second determination processing (transmission / non-transmission determination). As a method for determining whether the transmission to the user is completed or not used for the second determination process, whether the duration of the utterance output by the system is equal to or longer than a predetermined time set in advance. Can be determined. The predetermined time to be set may be the absolute value of the duration until the system utterance is paused, or more than 1/2 or 1/3 of the total length of the utterance that was originally supposed to be output. As described above, a method such as setting with a ratio is conceivable.
[0020]
FIG. 5 shows a flowchart for explaining the outline of the program for executing the first determination process and the second determination process.
Step SP6 shows a step of a portion for executing the first determination process (output forced / interrupt ignore). In step SP6, the importance of the content of the audio output being reproduced is determined. If it is determined in step SP6 that the importance is high, that is, it is determined that the interruption from the user is impossible, the process branches to step SP7 and step SP8, and the audio output SYS-3 is forcibly reproduced to the end, and the first response processing is performed. Execute.
On the other hand, if it is determined in step SP6 that the content of the second audio output SYS-3 is low, that is, it is determined that the interruption from the user should be dealt with, the process branches to step SP9 and proceeds from step SP9 to SP10.
At step SP10, the voice stop / restart control means 33 (FIG. 2) is activated, and the voice stop / restart control means 33 controls the voice output unit 16 (FIG. 1) to temporarily reproduce the second voice output SYS-3. Stop and end the second response process. After the second response process, the process proceeds to step SP11.
[0021]
In step SP11, a second determination process (transmission / non-transmission determination) is executed. The second determination process is the time d from the start of output of the second audio output SYS-3 to the temporary stop by the speech time measuring means 32._SYS3Then, with reference to the measured value of the output stop position S3mid of the second audio output SYS-3, it is determined whether or not it is considered that the transmission of the content of the second audio output SYS-3 to the user is completed. This determination is made at steps SP12 and SP14.
At step SP12, time d_SYS3Is longer than a preset threshold value, or the stop position S3mid is the latter half of the second audio output after half of the total time, the contents of the second audio output SYS-3 as shown in step SP13 Performs a process of determining that it has been transmitted to the user ("transmitted").
On the other hand, if the time d in step SP14_SYS3Is shorter than the threshold value or the output stop position S3mid is determined to be the first half of the second audio output SYS-3,
[0022]
In step SP15, the contents of the second audio output SYS-3 are processed as being handled as “untransmitted” to the user.
After executing step SP13 or SP15, the process proceeds to step SP16 and waits for the end of the interrupt voice USR-4 to be detected.
When the end of the interrupt voice input USR-4 is detected, the process proceeds to a third determination process (fragment / non-fragment determination) in step SP17 shown in FIG.
The third determination process performed in step SP17 analyzes the presence / absence of the contents of the interrupt voice input USR-4, and the presence / absence of the contents and the “transferred” to the user determined in steps SP13 and SP15 in the second determination process. ”And“ untransmitted ”are classified into four states.
[0023]
Step SP18 indicates a step in which the first classification is selected. In the first classification step SP18, the content (speech amount) of the second audio output SYS-3 is “transmitted” to the user, and the content of the interrupt audio input USR-4 exists sufficiently. It is selected when it is “speech”.
If this step SP18 is selected, the interrupt voice input USR-4 is processed as a response to the second voice output SYS-3 in step SP22. That is, the voice recognition result of the interrupt voice input USR-4 is recorded in the input item recording means 28 (FIG. 1) as a response to the second voice output SYS-3 (step SP29). This is referred to as third response processing. In the third response process, the second audio output SYS-3 from the system is transmitted to the user safely, and the interrupt input USR-4 from the user who interrupts the SYS-3 has sufficient contents. In some cases, the system may respond to the interrupted USR-4 input as follows:
[0024]
Step SP19 indicates a step in which the second classification is selected. In the second classification step SP19, it is selected when the second voice output SYS-3 is “not transmitted” to the user, the utterance of the interrupt voice output USR-4 exists sufficiently, and is “non-fragmented utterance”. . In the case of this classification, the user does not think that the interrupt input of USR-4 was performed after listening to the second audio output SYS-3 from the system, but the first output from the system at the previous stage. It should be considered that one voice output SYS-1 is responded. In other words, the user responds to the first voice output SYS-1 with the first voice input SYS-2, but in addition to that, in order to continue the response, it is considered that the user has uttered the second voice input USR-4. . This is a case where the start time (U4start) of the second audio input USR-4 happens to be later than the start time (S3start) of the second audio output SYS-3.
Therefore, when this step SP19 is selected, the interruption voice input USR-4 is processed as a response to the first voice output SYS-1 at step SP23. Therefore, in this case, in step SP29, the speech recognition result of the interrupt speech input USR-4 is recorded in the input item recording means 28 (FIG. 1) as a response to the first speech output SYS-1. This process is referred to herein as a fourth response process.
[0025]
Step SP20 indicates a step in which the third classification is selected. The third classification step SP20 is selected when the second voice output SYS-3 is “transmitted” to the user and the amount of utterance of the interrupt voice output USR-4 is small and is “fragment utterance”. In the case of this third classification, the user has heard the second audio output SYS-3 from the system, but there is something that he wants to convey specifically, and the interrupt input USR-4 is not uttered. I think that I just gave up talking about something or stopped talking about it. That is, it is considered that the system does not need to respond to the second voice input USR-4 that the user has interrupted, and should resume the second voice output SYS-3 once stopped.
Therefore, when this step SP20 is selected, the process proceeds to step SP24, and a process of resuming the reproduction of the second audio output SYS-3 is executed at step SP24. This process is referred to herein as a fifth response process.
Step SP21 indicates a step in which the fourth classification is selected. This fourth classification step SP21 is selected when the second speech output SYS-3 is “not transmitted” to the user and the amount of speech of the interrupt speech input USR-4 is small and is “fragment speech”.
When this step SP21 is selected, it is further classified into two steps SP25 and SP26.
[0026]
In step SP25, in addition to the conditions “SYS-3 is not transmitted” and “USR-4 is a fragment utterance” defined in step SP21, an utterance exists in the first voice input USR-2 (USR-2 ≠ in FIG. 6). 0). The situation that satisfies this condition continues because the user uttered the first voice input USR-2 with respect to the first voice output SYS-1, but “SYS-3 has not been transmitted”. The second voice input USR-4 is not returned after listening to the second voice output SYS-3, but the second voice input USR-4 is started as an utterance following the first voice input. Since the second voice output SYS-3 is overlapped, it is considered that the second voice input USR-4 was stopped halfway and a fragment utterance occurred. Therefore, since it is considered that the system / user has stopped trying to speak SYS-3 / USR-4, in this situation, the system outputs the second audio output SYS-3 as originally planned, I think I should start speaking again. For this reason, when step SP25 is selected, the process proceeds to step SP27, where the second audio output SYS-3 is started to be reproduced from the beginning. This process is referred to herein as a sixth response process.
By the sixth response process, the user receives the second voice output SYS-3 from the beginning, and waits for the user to input a response to the second voice output SYS-3.
[0027]
On the other hand, if the first voice input USR-2 is no voice (shown as USR-2 = 0 in FIG. 6), the response to the first voice output SYS-1 has not yet been determined in step SP26. Judge that there is no. That is, the user is delayed in responding to the first audio output SYS-1 for some reason, and finally tries to start the response, but the system side changes the first audio output SYS-1 to the first audio output SYS-1. On the other hand, since there is no response from the user, it is determined that the response corresponding to the first voice input USR-2 is silent, and the next utterance for the response due to silence, the second voice output SYS-3 is started. It can be considered that the second voice input USR-4 from the user accidentally started to overlap. In this case, both the system and the user have stopped trying to speak SYS-3 / USR-4, but the USR-4 that the user has stopped talking starts responding to the first audio output SYS-1. Therefore, it is considered desirable that the system repeats the first audio output SYS-1 once again in order to prompt a response to the first audio output SYS-1. Therefore, when step SP26 is selected, the first audio output SYS-1 is reproduced again from the beginning in step SP28, and a response to the first audio output SYS-1 is input from the user. This process is referred to herein as a seventh response process.
After executing the fifth response process, the sixth response process, and the seventh response process, in step SP30, the fifth response process (resumption of the second audio output SYS-3), the sixth response process (second audio output). Replay from the beginning of SYS-3), voice input input by the user for the utterance from the system output in the seventh response process (replay of the first sound output SYS-1) Performs standby and standby processing.
An example of each response processing operation described above will be described with reference to FIGS.
[0028]
FIG. 7 shows an example of normal response processing interaction. The first voice output SYS-1 outputs a voice saying “Please enter the area you want to find”. After the voice ends, the user replies “Yes, please in Shinjuku” with the first voice input USR-2.
After the answer is completed, a confirmation inquiry “Are you sure you want to go to Shinjuku?” Is output from the system in the second audio output SYS-3. After the voice output SYS-3 is completed, the user answers “Yes” with the second voice input USR-4. The system accepts the voice input USR-4, completes the question about the area output by SYS-1 and moves on to the next question, and the voice output SYS-5 "Please tell me the genre." Is output. In this way, normal response processing is executed by repeating the voice output and the voice input alternately without causing any interruption.
[0029]
FIG. 8 shows an interactive example of the third response process described above. The first voice output SYS-1 and the first voice input USR-2 are processed in the same procedure as the normal response process. Here, an example in which the second audio input USR-4 is generated by interruption with respect to the second audio output SYS-3 is shown. If the second voice input USR-4 generates an interrupt, but the generation position of the interrupt voice input USR-4 is assumed to be behind the center of the second voice output SYS-3, the second determination means 30 -2 determines that the content of the second audio output SYS-3 is "transmitted" to the user, and the interrupt audio input USR-4 is uttered "Yes" and is "non-fragmented utterance" In step SP22 shown in FIG. 6, the second audio input USR-4 is processed as an answer to the second audio output SYS-3. As a result, the third voice output SYS-5 is output to wait for the end of the interrupt voice input USR-4 and inquire about the next input item.
[0030]
FIG. 9 shows an example of dialogue of the fourth response process. This example shows a state in which an interrupt voice input USR-4 is generated immediately after the start of the second voice output SYS-3. Since the interruption is immediately after the start of the output of the second audio output SYS-3, the content of the second audio output SYS-3 is determined to be “untransmitted” to the user. On the other hand, since the interrupt voice input USR-4 is “So Shinjuku,” and “Non-fragment utterance”, step SP19 shown in FIG. 6 is selected, and the interrupt voice input USR-4 is set to the first. Processing is performed as an answer to the voice output SYS-1. As a result, the second audio output SYS-3 is treated as if it did not exist, and the interrupt audio input USR-4 is treated as an answer to the first audio output SYS-1, so that the third audio output SYS-5 is “ A dialogue scenario different from the second voice output SYS-3 that should have been originally output is selected and voice output is performed to the user.
[0031]
FIG. 10 shows an interactive example of the fifth response process. In this example, a case where an interrupt voice input USR-4, which is a recognition result that does not have sufficient content at the time when the central portion has elapsed from the start of reproduction of the second voice output SYS-3, is generated. Since more than half of the second audio output SYS-3 is reproduced, it is processed as “transmitted” to the user. Further, since the interrupt voice input USR-4 is determined as “fragment utterance”, in this case, step SP20 shown in FIG. 6 is selected, and by selecting step SP20, the interrupt voice input USR-4 is selected. Since the second audio output SYS-3 is handled as being transmitted to the user, the reproduction of the audio output SYS-3 is resumed from the position where it was temporarily stopped by the interruption. In this example, the resuming position of the reproduction is set to the start point side from the interruption position, and the recognizable content is improved by partially reproducing the reproduced contents.
The reproduction of the second audio output SYS-3 is resumed, and after completion of the reproduction, the answer to the question made in the audio output SYS-1 is passed through the SYS-3 by inputting “Yes OK” by the audio input USR-5. Confirmed.
[0032]
FIG. 11 shows the dialog side of the sixth response process. In this example, a case where an interrupt voice input USR-4, which is a recognition result that does not have sufficient contents immediately after the start of output of the second voice output SYS-3, is generated is shown. In this case, since the start point of the interrupt sound input USR-4 is immediately after the start of the second sound output SYS-3, it is determined that the content of the second sound output is “untransmitted” to the user. . Furthermore, since the interrupt voice input USR-4 is also “fragment utterance”, step SP21 shown in FIG. 6 is selected. Furthermore, since the first voice input USR-2 is sounded here, it is determined that the answer to the question made with the first voice output SYS-1 has already been made with the first voice input USR-2. The Therefore, in this case, the second audio output SYS-3 is reproduced from the beginning, and the user is asked for an answer to the second audio output SYS-3.
[0033]
FIG. 12 shows an interactive example of the seventh response process. In this example, with respect to the first audio output SYS-1, the user does not utter at the first audio input USR-2, the response is silent, and is sufficient immediately after the start of the second audio output SYS-3. The case where the interruption voice input USR-4 which is a recognition result having no special content occurs is shown.
In this case, since the second audio output SYS-3 is interrupted immediately after the start, the content is determined to be “untransmitted” to the user. Further, since the interrupt voice input USR-4 is also determined to be “fragment utterance”, step SP21 shown in FIG. 6 is selected. Since step SP21 is selected and the first voice input USR-2 is further unanswered, in this case, the process proceeds to step SP26, and it is determined in step SP26 that the answer to the first voice input USR-2 has not yet been made. Then, the first audio output SYS-1 is reproduced again, and the user is asked for an answer to the first audio output SYS-1.
[0034]
The spoken dialogue control method and the spoken dialogue control apparatus described above are realized by installing the spoken dialogue control program according to the present invention in a computer, causing the computer to decode and execute the program. The spoken dialogue control program according to the present invention is described by a computer-readable code string, recorded on a computer-readable recording medium, read from the recording medium, installed in the computer, or read from the recording medium. Installed on the computer through the communication line.
[0035]
【The invention's effect】
As described above, according to the present invention, even when a user generates an interrupt voice input during the voice output of the system, the occurrence status is classified into a plurality of processes, and the interrupt for all situations is processed. Also, since appropriate dialogue processing and input value acquisition processing are executed, the dialogue is not confused and necessary items can be inputted without burdening the user.
[Brief description of the drawings]
FIG. 1 is a block diagram for explaining an example of a preferred voice interaction apparatus to which the present invention is applied.
FIG. 2 is a block diagram for explaining a configuration of a main part of the present invention.
FIG. 3 is a timing chart for explaining the operation of the present invention.
FIG. 4 is a flowchart for explaining a branch portion of the operation of the normal dialogue control unit shown in FIG. 2 and an interrupt processing unit as a main part of the present invention;
FIG. 5 is a flowchart for explaining operations of a first response process and a second response process proposed in the present invention;
FIG. 6 is a flowchart for explaining operations of third response processing to seventh response processing proposed in the present invention;
7 is a timing chart for explaining an example of dialogue during normal dialogue processing of the normal dialogue control unit shown in FIG. 2;
FIG. 8 is a timing chart for explaining an interactive example of third response processing proposed in the present invention;
FIG. 9 is a timing chart for explaining an interactive example of fourth response processing proposed in the present invention;
FIG. 10 is a timing chart for explaining an interactive example of fifth response processing proposed in the present invention;
FIG. 11 is a timing chart for explaining an interactive example of sixth response processing proposed in the present invention;
FIG. 12 is a timing chart for explaining an interactive example of a seventh response process proposed in the present invention;
FIG. 13 is a timing chart for explaining an example of conventional technology.
FIG. 14 is a timing chart for explaining another example of the prior art.
[Explanation of symbols]
100 Spoken Dialogue Control Device 28 Input Item Recording Unit
101 Input Speech Processing Device 200 Spoken Dialogue System
102 voice output device 30-1 first determination means
103 Normal Dialogue Control Unit 30-2 Second Determination Unit
104 interrupt processing unit 30-3 third determination means
105 CPU 31-1 first response means
11 voice section detection unit 31-2 second response means
12 voice recognition unit 31-3 third response means
13 Voice Understanding Unit 31-4 Fourth Response Means
14 Context Understanding Unit 31-5 Fifth Response Means
15 Speech synthesis unit 31-6 Sixth response means
16 Audio output unit 31-7 Seventh response means
21 Latest understanding state receiving means 32 Utterance time measuring means
22 start / end signal receiving means 33 voice stop / restart control means
23 Dialogue scenario storage means SYS-1 First audio output
24 Dialogue scenario selection means SYS-3 Second audio output
25 Character string output means USR-2 First voice input
26 Normal judgment means USR-4 Interrupt voice input
27 Normal response means

Claims

Each time the first voice output from the system is finished, a response corresponding to the first voice output is inputted by voice, the voice input is voice-recognized by the system, and the voice recognition result is finished immediately before. In the voice interaction control method of executing a normal response process acquired as an input value corresponding to the above and repeating a dialog operation for generating a second audio output after the normal response process,
When the first voice output of the system is finished and an interrupt voice input is generated during the second voice output, the voice output stop / non-stop information predetermined in consideration of the importance of the content of the second voice output. A first determination process for determining whether to continue or temporarily stop the second audio output based on
When it is determined in the first determination process that the second audio output is to be temporarily stopped, the time required from the start of the second audio output to the start of the interrupt audio input from the user is measured. 2 If the interrupt position from the user during voice output is acquired and the time of the interrupt position is earlier than a predetermined value set in advance, it is determined that the content of the second voice output has not been transmitted to the user, A second determination process for determining that the content of the second audio output has been transmitted to the user if the time at the interrupt position is later than a predetermined value set in advance;
In response to the above interrupted voice input from the user, the interrupted voice was a non-fragmented utterance with sufficient content or not sufficient content from the recognition result, recognition likelihood, duration of the recognized voice, etc. A third determination process for determining whether the utterance was fragmented,
According to each determination result of the first determination process, the second determination process, and the third determination process,
A first response process for executing an operation of continually outputting the second voice output while ignoring the interrupt voice input when it is determined that the second voice output is continued in the first determination process;
A second response process for executing a process of pausing the reproduction of the second audio output when it is determined that the second audio output is paused in the first determination process;
If it is determined in the second determination process that the transmission has been completed, and the third determination process determines that the interrupt input is a non-fragment utterance, a process of capturing the interrupt audio input as a response to the second audio output is performed. A third response process to be performed;
If it is determined in the second determination process that it has not been transmitted, and the third determination process determines that the interrupt voice input is a non-fragment utterance, the interrupt voice input is used as a response to the first voice output. A fourth response process for performing a capture process;
A fifth response process for resuming reproduction of the second audio output stopped by the second response process when it is determined that the transmission has been completed in the second determination process and a fragment utterance is determined in the third determination process; ,
If it is determined in the second determination process that the transmission has not been performed, and the third determination process determines that the interrupt input is a fragment utterance, the second audio output stopped by the second response process is reproduced. A sixth response process for outputting from the beginning,
In the second determination process, it is determined that it has not been transmitted. In the third determination process, the interrupted voice input is determined as a fragment utterance. A seventh response process for outputting the first audio output from the beginning;
A voice dialogue control method for executing any of the above.

Each time the first voice output from the system is finished, a response from the user corresponding to the voice output is inputted by voice, the voice input is voice-recognized by the system, and the voice recognition result is finished immediately before. In a voice interaction control device that executes a normal response process that is acquired as an input value corresponding to a voice output and repeats a dialog operation that generates a second voice output again after the normal response process,
When the first voice output of the system is finished and an interrupt voice input from the user occurs during the second voice output, the voice output is determined in advance in consideration of the importance of the content of the second voice output. First determination means for determining whether to continue or temporarily stop the second audio output based on stop / non-stop information;
When it is determined that the second sound output is temporarily stopped in the first determination means, the time required from the start of the second sound output to the start of the interrupt sound input from the user is measured. 2 If the interrupt position from the user during voice output is acquired and the time of the interrupt position is earlier than a predetermined value set in advance, it is determined that the content of the second voice output has not been transmitted to the user, Second determination means for determining that the content of the second audio output has been transmitted to the user if the time of the interrupt position is later than a predetermined value set in advance;
In response to the above interrupted voice input from the user, the interrupted voice was a non-fragmented utterance with sufficient content or not sufficient content from the recognition result, recognition likelihood, duration of the recognized voice, etc. Third determination means for determining whether the utterance was fragmented,
And according to the determination results of the first determination unit, the second determination unit, and the third determination unit,
First response means for performing an operation of continually outputting the second voice output ignoring the interrupt voice input when the first judgment means determines to continue the second voice output;
Second response means for temporarily stopping reproduction of the second audio output when the first determination means determines that the second audio output is to be paused;
When the second determination means determines that the transmission has been completed, and the third determination means determines that the interrupt input is non-fragmented speech, a process of capturing the interrupt voice input as a response to the second voice output is performed. Third response means to perform;
If it is determined that the second determination means has not been transmitted, and the third determination means determines that the interrupted voice input is a non-fragmented speech, the interrupted voice input is used as a response to the first voice output. A fourth response means for performing a capture process;
Fifth response means for resuming reproduction of the second audio output stopped by the second response means when the second determination means determines that transmission has been completed and the third determination means determines fragment utterance When,
If it is determined that the second determination means has not been transmitted and the third determination means determines that the speech is fragmented, the second sound output stopped by the second response means is output again from the beginning. 6 response means;
If it is determined that the second determination means has not been transmitted, the third determination means determines that the interrupted voice input is a fragment utterance, and if there is no response from the user to the first voice output, A seventh response means for outputting the first audio output from the beginning;
A spoken dialogue control device characterized by comprising:

A spoken dialogue control program, which is described by a computer-readable code string and causes the computer to execute the spoken dialogue control method according to claim 1.