JP2009053581A

JP2009053581A - Speech output device

Info

Publication number: JP2009053581A
Application number: JP2007222206A
Authority: JP
Inventors: Satoshi Watanabe; 聡渡辺
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2007-08-29
Filing date: 2007-08-29
Publication date: 2009-03-12

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a speech output device changing a speaking pace to a comfortable one by a simple method. <P>SOLUTION: The speech output device which changes a speaking speed of output speech or a pause length, or both of them, by inputting speech, comprises: a speech detection section 110 for detecting input of speech; a speech output section 150 for outputting predetermined output speech; and a control section 120 for controlling the speaking speed of output speech or the pause length, or both of them. The control section 120 controls the speaking speed of output speech or the pause length, or both of them, of a following block, based on elapsed time from when the speech output section 150 outputs the output speech of one block, to when the speech detection section 110 detects the input of speech. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、出力音声の話速もしくはポーズ長またはその双方を可変する音声出力装置に関するものである。 The present invention relates to an audio output device that can vary the speech speed and / or pause length of output audio.

音声ガイダンスシステムのような、所定の音声を出力してユーザに音声による案内を提供する装置では、通常、複数の音声ブロック（単語、フレーズ、文、パラグラフ等）を連続的に音声出力するが、音声出力するペース（以下、発話ペースと称する）が、ユーザの望む発話ペースと合致せず、ユーザにとって聞きづらい音声が提示されてしまうという課題があった。 In an apparatus that outputs a predetermined voice and provides a voice guidance to a user, such as a voice guidance system, a plurality of voice blocks (words, phrases, sentences, paragraphs, etc.) are usually output continuously. There has been a problem that the pace at which the voice is output (hereinafter referred to as the utterance pace) does not match the utterance pace desired by the user, and voice that is difficult to hear for the user is presented.

そこで、音声合成装置に関し、『複数の文からなるテキストを文単位で読み上げるテキスト音声合成装置に係り、途中の文の読み上げ開始のタイミングをユーザが制御して内容を理解した後、次の文の音声合成に移ることを可能とする。』ことを目的とした技術として、『テキストデータに含まれるまたは付加した予め定めたデータを区切り情報として識別する区切り情報識別部３と、該区切り情報識別部３からの識別信号に応答して音声合成動作の中断を指示する合成中断制御部４と、ユーザの発声に基づく所定の音声情報を出力の開始情報として識別する開始情報識別部８と、該開始情報識別部８からの再開信号に応答して中断されたテキストデータの次の文に対応する音声合成動作の開始を指示する合成開始制御部９とを有して成ることを特徴とする音声合成装置。』というものが提案されている（特許文献１）。 Therefore, with regard to the speech synthesizer, “in relation to a text-to-speech synthesizer that reads out a text composed of a plurality of sentences in units of sentences, the user controls the timing to start reading a sentence in the middle and understands the contents. It is possible to move to speech synthesis. As a technology for the purpose of the above, “a delimiter information identifying unit 3 that identifies predetermined data included in or added to text data as delimiter information, and a voice in response to an identification signal from the delimiter information identifying unit 3 Responsive to a restart signal from the start information identifying unit 8, a synthesis interruption control unit 4 that instructs to interrupt the synthesis operation, a start information identifying unit 8 that identifies predetermined voice information based on the user's utterance as output start information And a synthesis start control unit 9 for instructing the start of the speech synthesis operation corresponding to the next sentence of the text data interrupted. Is proposed (Patent Document 1).

特開平８−２４８９９０号公報（要約）JP-A-8-248990 (Summary)

上記特許文献１に記載の音声合成装置は、中断／再開機能と、ユーザの発声を検出する機能とを備え、音声提示１ブロック（例えば、「１文」）毎に、ユーザの確認音声を待って、次の音声ブロックを再生できるように構成されている。
この音声合成装置によれば、ユーザは、１ブロック毎に音声内容を確認しながら、自分のペースで音声を聞くことができる。 The speech synthesizer described in Patent Document 1 has a pause / resume function and a function for detecting a user's utterance, and waits for a user's confirmation voice for each voice presentation block (for example, “one sentence”). Thus, the next audio block can be reproduced.
According to this voice synthesizer, the user can listen to the voice at his / her own pace while checking the voice content for each block.

しかしながら、上記特許文献１に記載の技術は、主として聞き落とし等の防止に主眼を置いたものであり、音声聴取時のユーザの快適性確保に関しては課題があった。例えば、以下の（１）〜（２）のような課題がある。 However, the technique described in Patent Document 1 mainly focuses on prevention of overhearing and the like, and there is a problem with ensuring user comfort when listening to audio. For example, there are the following problems (1) to (2).

（１）音声ブロック毎に、都度確認発声をする必要があるため、ユーザに負担がかかる。例えば、ニュースのような音声（１ブロックを１文）を聴取する場合に、一文毎に確認発声もしくは確認判断をする必要があり、わずらわしい。
（２）確認発声による間接的な発話制御はできるが、発話ペースそのものをユーザにとって快適なものに変更することはできない。 (1) Since it is necessary to make a confirmation utterance for each voice block, a burden is imposed on the user. For example, when listening to sound such as news (one sentence for one block), it is necessary to make a confirmation utterance or confirmation for each sentence, which is troublesome.
(2) Although indirect utterance control by confirmation utterance can be performed, the utterance pace itself cannot be changed to a comfortable one for the user.

そのため、簡易な方法で発話ペースをユーザにとって快適なものに可変することのできる音声出力装置が望まれていた。 Therefore, there has been a demand for an audio output device that can vary the utterance pace to be comfortable for the user by a simple method.

本発明に係る音声出力装置は、音声を入力することにより、出力音声の話速もしくはポーズ長またはその双方を可変する音声出力装置であって、音声の入力を検出する音声検出部と、所定の出力音声を出力する音声出力部と、前記出力音声の話速もしくはポーズ長またはその双方を制御する制御部と、を備え、前記制御部は、前記音声出力部が１ブロックの前記出力音声を出力した後から、前記音声検出部が音声の入力を検出するまでの経過時間に基づき、次のブロックの出力音声の話速もしくはポーズ長またはその双方を制御するものである。 An audio output device according to the present invention is an audio output device that changes a speech speed and / or pause length of an output audio by inputting the audio, and includes an audio detection unit that detects the input of the audio, An audio output unit that outputs an output audio; and a control unit that controls a speech speed and / or pause length of the output audio, and the control unit outputs the output audio of one block. After that, the speech speed and / or pause length of the output voice of the next block is controlled based on the elapsed time from when the voice detection unit detects voice input.

本発明に係る音声出力装置によれば、音声ブロックを出力してから音声入力を検出するまでの経過時間に基づき、次のブロックの出力音声の話速やポーズ長を制御するので、話速を速めたい場合は音声ブロック終了後即座に確認音声を入力する、といったような簡易な方法で、発話ペースをユーザにとって快適なものに可変することができる。 According to the audio output device of the present invention, the speech speed and pause length of the output sound of the next block are controlled based on the elapsed time from the output of the audio block until the audio input is detected. If it is desired to speed up, the speech pace can be changed to a comfortable one for the user by a simple method such as inputting a confirmation voice immediately after the end of the voice block.

実施の形態１．
図１は、本発明の実施の形態１に係る音声出力装置１００の機能ブロック図である。
音声出力装置１００は、所定の出力音声を出力する装置であり、ユーザが確認音声を入力するタイミングによって、話速やポーズ長を可変し、ユーザにとって聞き取りやすいようにこれらを調整することができるものである。ここでいう「ポーズ」とは、音声ブロック間の無音区間のことをいう。
以下、音声出力装置１００の構成について説明する。 Embodiment 1 FIG.
FIG. 1 is a functional block diagram of an audio output device 100 according to Embodiment 1 of the present invention.
The audio output device 100 is a device that outputs predetermined output audio, and can change the speech speed and pause length according to the timing when the user inputs confirmation audio, and adjust these so that the user can easily hear them. It is. The “pause” here refers to a silent section between speech blocks.
Hereinafter, the configuration of the audio output device 100 will be described.

音声出力装置１００は、音声検出部１１０、発話ペース制御部１２０、可変ルールテーブル１３０、音声データベース（以下、音声ＤＢと称す）１４０、音声出力部１５０を備える。 The voice output device 100 includes a voice detection unit 110, an utterance pace control unit 120, a variable rule table 130, a voice database (hereinafter referred to as a voice DB) 140, and a voice output unit 150.

音声検出部１１０は、ユーザが入力する任意の音声データを入力として受け取り、発話ペース制御部１２０に音声入力開始通知を出力する。
具体的には、マイクロホンを介して得られた音声信号をＡＤ変換した後、デジタル信号処理によって音声検出処理を行う。検出方法は、音声検出に用いられる一般的な手法であれば何でもよいが、例えば、フレーム単位（例：長さ２０ｍｓ、周期５ｍｓ）で音声信号のパワーを計算し、一定レベル以上のパワーを持つフレームが所定時間（例：３〜５フレーム）続けば、音声区間が開始されたと判断する。
これに、自己相関関数の最大値や、メルケプストラムおよびその差分等から検出判定ルールを作るなどして、より検出精度を高めてもよい。
音声検出部１１０は、音声区間が開始されたと判断すると、即座に音声入力開始通知を出力する。 The voice detection unit 110 receives any voice data input by the user as an input, and outputs a voice input start notification to the utterance pace control unit 120.
Specifically, after an audio signal obtained via a microphone is AD converted, an audio detection process is performed by digital signal processing. Any detection method may be used as long as it is a general method used for voice detection. For example, the power of a voice signal is calculated in units of frames (eg, length 20 ms, period 5 ms) and has a power of a certain level or more. If the frame continues for a predetermined time (e.g., 3 to 5 frames), it is determined that the voice section has started.
In addition, detection accuracy may be further increased by creating a detection determination rule from the maximum value of the autocorrelation function, the mel cepstrum, and its difference.
When the voice detection unit 110 determines that the voice section has started, it immediately outputs a voice input start notification.

発話ペース制御部１２０は、音声検出部１１０から受信した音声入力開始通知と、音声出力部１５０から受信した音声出力終了通知および音声出力開始通知とに基づき、音声出力部１５０に発話ペースの可変ないし維持を指示する。 Based on the voice input start notification received from the voice detection unit 110 and the voice output end notification and voice output start notification received from the voice output unit 150, the utterance pace control unit 120 allows the voice output unit 150 to change the utterance pace. Direct maintenance.

発話ペース制御部１２０は、タイミング解析部１２１、音声出力指示部１２２を内部構成として備える。 The speech pace control unit 120 includes a timing analysis unit 121 and a voice output instruction unit 122 as internal configurations.

タイミング解析部１２１は、後述の図２で説明する各通知の受信タイミングに基づき、音声出力指示部１２２に対し、（ＵＰ／ＫＥＥＰ／ＤＯＷＮ）の３種類のタイミング解析結果を出力する。 The timing analysis unit 121 outputs three types of timing analysis results (UP / KEEP / DOWN) to the audio output instruction unit 122 based on the reception timing of each notification described in FIG.

音声出力指示部１２２は、音声出力部１５０から音声出力終了通知および音声出力開始通知を受信し、また、タイミング解析部１２１からタイミング解析結果を受信する。さらには、音声出力部１５０に対し、話速選択指示とポーズ長調整指示を出力する。
話速選択指示とポーズ長調整指示の内容は、以下の（１）〜（３）の通りである。これらの詳細は、後述の図３で説明する。 The audio output instruction unit 122 receives an audio output end notification and an audio output start notification from the audio output unit 150, and receives a timing analysis result from the timing analysis unit 121. Furthermore, a speech speed selection instruction and a pause length adjustment instruction are output to the voice output unit 150.
The contents of the speech speed selection instruction and pause length adjustment instruction are as follows (1) to (3). Details thereof will be described later with reference to FIG.

（１）タイミング解析部１２１から「ＵＰ」を受け取った場合、話速を早くし、もしくはポーズ長を短くし、またはこれらの双方を行う。
（２）タイミング解析部１２１から「ＫＥＥＰ」を受け取った場合、話速もしくはポーズ長またはこれらの双方を維持する。
（３）タイミング解析部１２１から「ＤＯＷＮ」を受け取った場合、話速を遅くし、もしくはポーズ長を長くし、またはこれらの双方を行う。 (1) When “UP” is received from the timing analysis unit 121, the speech speed is increased, the pause length is decreased, or both are performed.
(2) When “KEEP” is received from the timing analysis unit 121, the speech speed and / or pause length are maintained.
(3) When “DOWN” is received from the timing analysis unit 121, the speech speed is slowed down, the pause length is lengthened, or both are performed.

可変ルールテーブル１３０は、後述の図３で説明する可変ルールを格納している。 The variable rule table 130 stores variable rules described later with reference to FIG.

音声データベース１４０は、各出力音声に対応した音声データ（例えばｗａｖファイル、以下同じ）を、音声ブロック毎に格納している。
なお、音声データベース１４０は、同じ内容の出力音声について、複数の話速に対応した音声データを備えている。例えば、同一内容の出力音声について、（早口／普通／ゆっくり）といった複数の話速で発話した音声データをそれぞれ格納している。 The audio database 140 stores audio data (for example, a wav file, the same applies hereinafter) corresponding to each output audio for each audio block.
Note that the voice database 140 includes voice data corresponding to a plurality of speech speeds for the output voice having the same content. For example, voice data uttered at a plurality of speech speeds (such as fast / normal / slow) is stored for each output voice having the same content.

音声出力部１５０は、音声出力指示部１２２から、話速選択指示とポーズ長調整指示を受け取り、これに基づき音声ＤＢ１４０より適切な話速の音声データを読み取り、適切なポーズ長で音声出力する。
また、音声出力を開始した際は音声出力開始通知を、音声出力を終了した際は音声出力終了通知を、タイミング解析部１２１と音声出力指示部１２２に出力する。 The voice output unit 150 receives a speech speed selection instruction and a pause length adjustment instruction from the voice output instruction unit 122, reads voice data of an appropriate speech speed from the voice DB 140 based on the instruction, and outputs the voice with an appropriate pause length.
When the voice output is started, a voice output start notification is output to the timing analysis unit 121 and the voice output instruction unit 122 when the voice output is ended.

発話ペース制御部１２０は、その機能を実現する回路デバイスなどのハードウェアで構成することもできるし、マイコンやＣＰＵなどの演算装置上で実行されるソフトウェアとして構成することもできる。
可変ルールテーブル１３０、音声データベース１４０は、これらを構成するために必要な各データファイルと、そのデータファイルを格納するメモリやＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）などの記憶装置とにより構成することができる。ファイル形式等は適宜適切なものを用いればよい。 The utterance pace control unit 120 can be configured by hardware such as a circuit device that realizes the function, or can be configured as software executed on an arithmetic device such as a microcomputer or CPU.
The variable rule table 130 and the voice database 140 can be configured by data files necessary for configuring them, and a storage device such as a memory or HDD (Hard Disk Drive) for storing the data files. An appropriate file format may be used as appropriate.

なお、本実施の形態１における「制御部」は、発話ペース制御部１２０がこれに相当する。
また、「記憶部」は、可変ルールテーブル１３０を構成する記憶装置がこれに相当する。 Note that the “control unit” in the first embodiment corresponds to the speech pace control unit 120.
The “storage unit” corresponds to a storage device that configures the variable rule table 130.

以上、音声出力装置１００の各構成について説明した。
次に、図２を用いて、タイミング解析部１２１のタイミング解析内容について説明するが、これに先立ち、タイミング解析の基本となる考え方について説明しておく。 Heretofore, each configuration of the audio output device 100 has been described.
Next, the timing analysis contents of the timing analysis unit 121 will be described with reference to FIG. 2, but prior to this, the basic concept of timing analysis will be described.

一般的な会話における相槌の傾向として、相手にもっと早く話してもらいたい時には、相手の発話が終了すると即座に相槌を打つことで、相手に早く次の発話をするよう促すことが多い。一方、相手にもっとゆっくり話してもらいたい時には、その逆のことが多い。
そこで、本発明に係る音声出力装置１００においても、この動作を取り入れる。
即ち、音声出力装置１００が１ブロックの音声出力を終えた後、即座に確認音声が入力された場合には、ユーザが「もっと早く話してほしい」と感じているものと判断し、確認音声が入力されるまで時間がかかった場合には、ユーザが「もっとゆっくり話してほしい」と感じているものと判断する。 In general conversations, if you want the other party to speak more quickly, you often prompt the other person to speak immediately after the other person's utterance has been completed by hitting the conversation immediately. On the other hand, when you want the other party to speak more slowly, the opposite is often the case.
Therefore, this operation is also incorporated in the audio output device 100 according to the present invention.
That is, when the voice output device 100 finishes outputting one block of voice and the confirmation voice is input immediately, it is determined that the user feels “I want you to speak sooner” and the confirmation voice is If it takes time to input, it is determined that the user feels "I want to speak more slowly".

タイミング解析部１２１は、ユーザが「もっと早く話してほしい」と感じているものと判断した場合は「ＵＰ」を出力し、ユーザが「もっとゆっくり話してほしい」と感じているものと判断した場合は「ＤＯＷＮ」を出力する。これらの中間である場合は「ＫＥＥＰ」を出力する。
これらの判断基準の１例について、次の図２で説明する。 When the timing analysis unit 121 determines that the user feels "I want you to speak sooner", outputs "UP", and when the user determines that the user feels "I want you to speak more slowly" Outputs “DOWN”. If it is between these, “KEEP” is output.
An example of these criteria will be described with reference to FIG.

図２は、タイミング解析部１２１が、音声入力開始通知を受信するタイミングに基づき話速やポーズ長を可変するための判断基準の１例である。以下、音声入力開始通知を受信するタイミング毎に分けて説明する。
なお、図２に示す閾値ＴＨ１、ＴＨ２は、ＴＨ１＜ＴＨ２の関係にあるものとする。 FIG. 2 is an example of determination criteria for the timing analysis unit 121 to vary the speech speed and pause length based on the timing at which a voice input start notification is received. In the following, description will be made separately for each timing of receiving a voice input start notification.
Note that the thresholds TH1 and TH2 shown in FIG. 2 are in a relationship of TH1 <TH2.

（ａ）音声出力終了通知〜ＴＨ１で受信した場合
タイミング解析部１２１が、音声出力部１５０より１ブロックの音声出力の終了通知を受け取り、閾値ＴＨ１が経過するまでの間に、音声検出部１１０より音声入力開始通知を受け取った場合は、ユーザが「もっと早く話してほしい」と感じているものと判断し、「ＵＰ」を出力する。 (A) When the audio output end notification is received at TH1 The timing analysis unit 121 receives an audio output end notification from the audio output unit 150, and from the audio detection unit 110 until the threshold TH1 elapses. When the voice input start notification is received, it is determined that the user feels “I want to speak sooner”, and “UP” is output.

（ｂ）ＴＨ１〜ＴＨ２で受信した場合
タイミング解析部１２１が、音声出力部１５０より１ブロックの音声出力の終了通知を受け取り、閾値ＴＨ１が経過した後、閾値ＴＨ２が経過するまでの間に、音声検出部１１０より音声入力開始通知を受け取った場合は、現在の話速やポーズ長を維持するため、「ＫＥＥＰ」を出力する。 (B) When received by TH1 to TH2 The timing analysis unit 121 receives an end notification of one block of audio output from the audio output unit 150, and after the threshold value TH1 has elapsed, the threshold value TH2 has elapsed. When a voice input start notification is received from the detection unit 110, “KEEP” is output to maintain the current speech speed and pause length.

（ｃ）ＴＨ２以降、次の音声出力開始通知までに受信した場合
タイミング解析部１２１が、閾値ＴＨ２経過後、次の音声出力開始通知を受け取るまでの間に、音声検出部１１０より音声入力開始通知を受け取った場合は、ユーザが「もっとゆっくり話してほしい」と感じているものと判断し、「ＤＯＷＮ」を出力する。 (C) When received after TH2 and before the next audio output start notification The audio input start notification from the audio detection unit 110 until the timing analysis unit 121 receives the next audio output start notification after the threshold TH2 has elapsed. Is received, it is determined that the user feels “I want to speak more slowly”, and “DOWN” is output.

（ｄ）音声出力開始通知〜ＴＨ３で受信した場合
タイミング解析部１２１が、音声出力部１５０より１ブロックの音声出力の開始通知を受け取り、閾値ＴＨ３が経過するまでの間に、音声検出部１１０より音声入力開始通知を受け取った場合は、前の音声ブロックに対するユーザの応答が遅れたのか、それとも現在の音声ブロックに対して即座に応答したのか、いずれであるのかが判断できない。
そこで、ひとまず現在の話速やポーズ長を維持するため、「ＫＥＥＰ」を出力する。 (D) When received by voice output start notification to TH3 The timing analysis unit 121 receives the voice output start notification of one block from the voice output unit 150, and from the voice detection unit 110 until the threshold TH3 has elapsed. When the voice input start notification is received, it cannot be determined whether the user's response to the previous voice block is delayed or whether the current voice block is immediately responded.
Therefore, in order to maintain the current speech speed and pause length, “KEEP” is output.

（ｅ）ＴＨ３〜音声出力終了通知で受信した場合
タイミング解析部１２１が、音声出力部１５０より１ブロックの音声出力の開始通知を受け取り、閾値ＴＨ３が経過した後、音声検出部１１０より当該音声ブロックの音声出力終了通知を受け取るまでの間に、音声検出部１１０より音声入力開始通知を受け取った場合は、（ａ）と同様の例と判断し、「ＵＰ」を出力する。 (E) When TH3 is received as an audio output end notification The timing analysis unit 121 receives an audio output start notification from the audio output unit 150, and after the threshold TH3 has elapsed, the audio block from the audio detection unit 110 If the voice input start notification is received from the voice detection unit 110 until the voice output end notification is received, it is determined that the example is the same as (a), and “UP” is output.

なお、閾値ＴＨ１、ＴＨ２、ＴＨ３の値は、先に説明したタイミング解析の基本となる考え方の視点を元に、実際の対話データなどを解析してあらかじめ定めておく。例えば、音声出力のモーラ（拍）数で、ＴＨ１＝１モーラ程度、ＴＨ２＝３モーラ程度、ＴＨ３＝１モーラ程度、などと定めておく。
ただし、ＴＨ１〜ＴＨ３の値は、話速やポーズ長によって変えるのが好ましい。そのため、当該音声ブロックの話速やポーズ長に応じて、ブロック毎に決定するのがよい。
本実施の形態１では、話速に応じてＴＨ１〜ＴＨ３を決定するものとし、タイミング解析部１２１内の図示しない閾値テーブルに保持しておくものとする。 Note that the values of the thresholds TH1, TH2, and TH3 are determined in advance by analyzing actual dialogue data and the like based on the viewpoint of the concept that is the basis of the timing analysis described above. For example, the number of sound output mora (beats) is determined as TH1 = 1 mora, TH2 = 3 mora, TH3 = 1 mora, and the like.
However, it is preferable to change the values of TH1 to TH3 depending on the speech speed and pause length. Therefore, it is preferable to determine for each block according to the speech speed and pause length of the speech block.
In the first embodiment, TH1 to TH3 are determined according to the speech speed, and are stored in a threshold table (not shown) in the timing analysis unit 121.

図３は、可変ルールテーブル１３０の構成とデータ例を示すものである。
可変ルールテーブル１３０は、タイミング解析部１２１がタイミング解析を行った結果に基づき、音声出力指示部１２２が話速やポーズ長を可変する指示を行うためのルールを格納するものである。
図３のデータ例では、ポーズ長と話速の初期値をそれぞれＰ０、Ｖ０とする。
音声出力指示部１２２は、タイミング解析部１２１より「ＵＰ」「ＫＥＥＰ」「ＤＯＷＮ」を受け取ると、可変ルールテーブル１３０に示されるルールに基づき、それぞれ以下のように話速やポーズ長を可変する指示を行う。 FIG. 3 shows the configuration of the variable rule table 130 and data examples.
The variable rule table 130 stores rules for the voice output instruction unit 122 to instruct to change the speech speed and pause length based on the result of the timing analysis performed by the timing analysis unit 121.
In the data example of FIG. 3, the initial values of the pause length and speech speed are P0 and V0, respectively.
Upon receiving “UP”, “KEEP”, and “DOWN” from the timing analysis unit 121, the voice output instructing unit 122 instructs to change the speech speed and pause length based on the rules shown in the variable rule table 130 as follows. I do.

（１）ＵＰを受け取った場合
話速選択指示として、現在の話速より１段階「早く」する＋１を、音声出力部１５０へ出力する。
ポーズ長調整指示として、現在のポーズ長より１段階「短く」する＋１を、音声出力部１５０へ出力する。
（２）ＫＥＥＰを受け取った場合
何も出力しない。
（３）ＤＯＷＮを受け取った場合
話速選択指示として、現在の話速より１段階「遅く」する−１を、音声出力部１５０へ出力する。
ポーズ長調整指示として、現在のポーズ長より１段階「長く」する−１を、音声出力部１５０へ出力する。 (1) When UP is received As the speech speed selection instruction, +1, which is one stage “faster” than the current speech speed, is output to the voice output unit 150.
As a pause length adjustment instruction, +1, which is “one step shorter than the current pause length”, is output to the audio output unit 150.
(2) When KEEP is received, nothing is output.
(3) When DOWN is received As the speech speed selection instruction, “-1” that is “slower” than the current speech speed is output to the voice output unit 150.
As a pause length adjustment instruction, “-1” that is “longer than” the current pause length is output to the audio output unit 150.

図４は、音声出力装置１００の１動作例を示すものである。
ここでは、音声出力装置１００はマイクロホンとスピーカを接続した装置であり、ユーザはそのマイクロホンの前に座って音声を聴取するものとする。
音声出力装置１００は、１文毎に音声メッセージ（例えば「今日のニュース」など）をスピーカから出力していく。
ユーザは、（頷きながら）この音声メッセージを聴取し、ときおり声に出して相槌を打つ。この相槌は、例えば「うんうん」「へー」「なるほど」「はいはい」「次！」「ちょっとまって」といったものである。
ユーザから発せられる音声は、マイクロホンから音声出力装置１００に取り込まれ、音声検出が行われる。
以下、図４に従い、ステップを追って音声出力装置１００の動作を説明する。 FIG. 4 shows an operation example of the audio output device 100.
Here, it is assumed that the audio output device 100 is a device in which a microphone and a speaker are connected, and the user sits in front of the microphone and listens to the audio.
The voice output device 100 outputs a voice message (for example, “Today's news”) from the speaker for each sentence.
The user listens to this voice message (while whispering) and occasionally speaks out. For example, “Yes”, “Hey”, “I see”, “Yes,” “Next!”, “Slightly wait”.
The sound emitted from the user is taken into the sound output device 100 from the microphone, and sound detection is performed.
The operation of the audio output device 100 will be described below step by step according to FIG.

（１）タイミング解析部１２１は、初期値として、ポーズ長＝Ｐ０、話速＝Ｖ０に設定する。
（２）音声出力部１５０は、第１音声ブロック（Ｂ１）の音声出力を開始する。このときの音声データは、音声ＤＢ１４０内の、話速初期値Ｖ０に合致するものを用いる。
（３）音声ブロックＢ１の終了後、ポーズ長初期値Ｐ０が経過するまでの間に、ユーザからの確認音声入力を検出しない場合は、音声ブロックＢ１の終了後、ポーズ長Ｐ０が経過すると、音声出力部１５０は、直ちに第２音声ブロック（Ｂ２）の音声出力を開始する。 (1) The timing analysis unit 121 sets pause length = P0 and speech speed = V0 as initial values.
(2) The audio output unit 150 starts audio output of the first audio block (B1). As the voice data at this time, data that matches the speech speed initial value V0 in the voice DB 140 is used.
(3) If no confirmation voice input from the user is detected after the audio block B1 ends and before the pause length initial value P0 elapses, the audio is transmitted when the pause length P0 elapses after the audio block B1 ends. The output unit 150 immediately starts outputting the sound of the second sound block (B2).

（４）音声ブロックＢ２の終了後、ポーズ長Ｐ０が経過するまでの間に、ユーザが音声Ｓ２を発したものとする。
（５）音声検出部１１０は、音声Ｓ２を検出すると直ちに音声入力開始通知をタイミング解析部１２１に出力する。この出力時刻は、音声検出部１１０がフレーム処理を行うために、実際の発話開始時刻よりは少し遅延するが、ここでは遅延は無視する。 (4) Assume that the user utters the voice S2 after the end of the voice block B2 and before the pause length P0 has elapsed.
(5) Upon detecting the voice S2, the voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121 immediately. This output time is slightly delayed from the actual speech start time because the voice detection unit 110 performs frame processing, but the delay is ignored here.

（６）タイミング解析部１２１は、音声入力開始通知の受信時刻と、音声ブロックＢ２の終了通知の受信時刻とにより、先に図２で説明したルールに基づき、（ＵＰ／ＫＥＥＰ／ＤＯＷＮ）を判定する。判定結果は直ちに音声出力指示部１２２に出力される。
図４の例では、ＴＨ１〜ＴＨ２の期間に音声入力開始通知を受信しているので、「ＫＥＥＰ」を出力する。 (6) The timing analysis unit 121 determines (UP / KEEP / DOWN) based on the rule described above with reference to FIG. 2 based on the reception time of the voice input start notification and the reception time of the voice block B2 end notification. To do. The determination result is immediately output to the voice output instruction unit 122.
In the example of FIG. 4, since the voice input start notification is received during the period from TH1 to TH2, “KEEP” is output.

（７）音声出力指示部１２２は、「ＫＥＥＰ」メッセージを受け取ったので、話速やポーズ長の変更指示は行わず、直前の設定をそのまま維持する。
（８）音声出力部１５０は、ポーズ区間中に音声出力指示部１２２から指示を受けないので、音声ブロックＢ２の終了後、ポーズ長Ｐ０が経過すると、直ちに話速Ｖ０で第３音声ブロック（Ｂ３）の音声出力を開始する。 (7) Since the voice output instruction unit 122 has received the “KEEP” message, the voice output instruction unit 122 does not issue an instruction to change the speech speed or pause length, and maintains the previous setting as it is.
(8) Since the voice output unit 150 does not receive an instruction from the voice output instruction unit 122 during the pause period, when the pause length P0 has elapsed after the end of the voice block B2, the third voice block (B3 ) Audio output.

（９）音声ブロックＢ３の終了後、ポーズ長Ｐ０が経過するまでの間に、ユーザが音声Ｓ３を発したものとする。
（１０）音声検出部１１０は、音声Ｓ３を検出すると直ちに音声入力開始通知をタイミング解析部１２１に出力する。
（１１）タイミング解析部１２１は、ステップ（６）と同様に（ＵＰ／ＫＥＥＰ／ＤＯＷＮ）を判定し、音声出力指示部１２２に出力する。ここでは音声入力開始通知の受信からの経過時間がＴＨ１未満であるため、「ＵＰ」を出力する。 (9) Assume that the user utters the voice S3 after the end of the voice block B3 and before the pause length P0 elapses.
(10) The voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121 as soon as the voice S3 is detected.
(11) The timing analysis unit 121 determines (UP / KEEP / DOWN) as in step (6), and outputs the determination to the voice output instruction unit 122. Here, since the elapsed time from the reception of the voice input start notification is less than TH1, “UP” is output.

（１２）音声出力指示部１２２は、「ＵＰ」メッセージを受け取ったので、話速Ｖを＋１、ポーズ長Ｐを＋１と変更するように、音声出力部１５０へ指示する。
（１３）音声出力部１５０は、ポーズ長をＰ０＋１、話速をＶ０＋１に変更し、内部のポーズ長メモリおよび話速メモリ（ともに図示せず）に書き込む。
（１４）音声出力部１５０は、音声ＤＢ１４０より、話速Ｖ０＋１に対応した第４音声ブロック（Ｂ４）の音声データを読み出し、音声ブロックＢ３の終了後、ポーズ長Ｐ０＋１経過後に、音声ブロックＢ４の音声出力を開始する。 (12) Since the voice output instruction unit 122 has received the “UP” message, the voice output instruction unit 122 instructs the voice output unit 150 to change the speech speed V to +1 and the pause length P to +1.
(13) The voice output unit 150 changes the pause length to P0 + 1 and the speech speed to V0 + 1, and writes them in the internal pause length memory and the speech speed memory (both not shown).
(14) The audio output unit 150 reads the audio data of the fourth audio block (B4) corresponding to the speech speed V0 + 1 from the audio DB 140, and after the audio block B3 ends and after the pause length P0 + 1 has elapsed, the audio of the audio block B4 Start output.

（１５）音声ブロックＢ４の終了後、ポーズ長Ｐ０＋１が経過するまでの間に、ユーザが音声Ｓ４を発したものとする。
（１６）音声検出部１１０は、音声Ｓ４を検出すると直ちに音声入力開始通知をタイミング解析部１２１に出力する。
（１７）タイミング解析部１２１は、ステップ（６）と同様に（ＵＰ／ＫＥＥＰ／ＤＯＷＮ）を判定し、音声出力指示部１２２に出力する。ここでは音声入力開始通知の受信からの経過時間がＴＨ２以上であるため、「ＤＯＷＮ」を出力する。 (15) It is assumed that the user utters the voice S4 before the pause length P0 + 1 elapses after the voice block B4 ends.
(16) The voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121 as soon as the voice S4 is detected.
(17) The timing analysis unit 121 determines (UP / KEEP / DOWN) as in step (6), and outputs the determination to the voice output instruction unit 122. Here, since the elapsed time from the reception of the voice input start notification is TH2 or more, “DOWN” is output.

（１８）音声出力指示部１２２は、「ＤＯＷＮ」メッセージを受け取ったので、話速Ｖを−１、ポーズ長Ｐを−１と変更するように、音声出力部１５０へ指示する。
（１９）音声出力部１５０は、ポーズ長をＰ０、話速をＶ０に変更し、内部のポーズ長メモリおよび話速メモリ（ともに図示せず）に書き込む。
（２０）音声出力部１５０は、音声ＤＢ１４０より、話速Ｖ０に対応した第５音声ブロック（Ｂ５）の音声データを読み出し、音声ブロックＢ４の終了後、ポーズ長Ｐ０経過後に、音声ブロックＢ５の音声出力を開始する。 (18) Since the voice output instruction unit 122 has received the “DOWN” message, the voice output instruction unit 122 instructs the voice output unit 150 to change the speech speed V to −1 and the pause length P to −1.
(19) The voice output unit 150 changes the pause length to P0 and the speech speed to V0, and writes them in the internal pause length memory and the speech speed memory (both not shown).
(20) The audio output unit 150 reads the audio data of the fifth audio block (B5) corresponding to the speech speed V0 from the audio DB 140, and after the audio block B4 ends and after the pause length P0 has elapsed, the audio of the audio block B5 Start output.

以上のように、本実施の形態１によれば、発話ペース制御部１２０は、音声出力部１５０が音声ブロックを音声出力した後、音声検出部１１０がユーザの音声入力を検出するまでの経過時間に基づき、出力音声の話速やポーズ長を制御するので、ユーザが簡単な相槌をするのみで、自分のペースに合った発話ペースで音声出力を行うように調整することができる。 As described above, according to the first embodiment, the utterance pace control unit 120 has elapsed time from when the voice output unit 150 outputs a voice block to voice, until the voice detection unit 110 detects the user's voice input. Since the speech speed and pause length of the output voice are controlled based on the above, it is possible to adjust so that the voice output is performed at the utterance pace matching the user's pace only by the user having a simple reconciliation.

また、発話ペース制御部１２０は、各閾値ＴＨ１〜ＴＨ３を、各音声ブロックの話速やポーズ長に合わせてブロック毎に決定するので、各音声ブロックの発話ペースに合わせて最適な閾値を設定することができ、ユーザに対してきめ細かな対応をすることができる。
例えば、ポーズ長が長く設定されている場合には、ユーザが確認音声を入力する時間的余裕が十分にあり、したがって各閾値ＴＨ１〜ＴＨ３も長めに設定するとよいと考えられるところ、上記のように各音声ブロックの発話ペースに合わせてこれらの閾値を設定することにより、音声ブロックの実体に合った設定が可能となるのである。 Moreover, since the speech pace control unit 120 determines the thresholds TH1 to TH3 for each block according to the speech speed and pause length of each speech block, an optimal threshold is set according to the speech pace of each speech block. It is possible to respond to the user in detail.
For example, when the pause length is set to be long, there is sufficient time for the user to input the confirmation voice. Therefore, it is considered that the thresholds TH1 to TH3 should be set longer, as described above. By setting these threshold values in accordance with the speech pace of each voice block, it is possible to set the threshold in accordance with the substance of the voice block.

実施の形態２．
実施の形態１では、可変ルールテーブル１３０が保持している、ポーズ長と話速の可変ルールに基づき、音声ＤＢ１４０が格納しているどの音声データを用いるか、あるいはポーズ長をどうするか、といったことを決定することを説明した。
本発明の実施の形態２では、この可変ルールテーブル１３０の内容を変更することにより、話速やポーズ長の可変動作を、実施の形態１とは異なるものとすることについて説明する。
なお、その他の構成は実施の形態１と同様であるため、説明を省略する。 Embodiment 2. FIG.
In the first embodiment, which voice data stored in the voice DB 140 is used or what the pause length is to be used based on the pause length and speech speed variable rules held in the variable rule table 130. Explained that to determine.
In the second embodiment of the present invention, it will be described that by changing the contents of the variable rule table 130, the variable operation of speech speed and pause length is different from that of the first embodiment.
Since other configurations are the same as those of the first embodiment, description thereof is omitted.

図５は、可変ルールテーブル１３０の別の構成例を示すものである。
実施の形態１で説明した図３では、「ＵＰ」「ＤＯＷＮ」判断にともなって、ポーズ長Ｐと話速Ｖが連動して増減する可変ルールテーブルの例を説明したが、可変ルールはこれに限られるものではなく、音声出力装置１００の使用環境や音声ＤＢ１４０の内容に合わせて設定することができる。 FIG. 5 shows another configuration example of the variable rule table 130.
In FIG. 3 described in the first embodiment, the example of the variable rule table in which the pause length P and the speech speed V increase or decrease in conjunction with the determination of “UP” and “DOWN” has been described. The setting is not limited and can be set according to the usage environment of the audio output device 100 and the contents of the audio DB 140.

図５（ａ）は、ポーズ長の調整を優先させ、ポーズ長の調整が飽和したら話速の調整に転ずるルール例である。
図５（ｂ）は、図５（ａ）とは反対に、話速の調整を優先させたルールである。
図５（ｃ）は、話速とポーズ長の調整を混在させて行うルールである。 FIG. 5A is an example of a rule in which priority is given to the adjustment of the pause length, and when the pause length adjustment is saturated, the speech speed is adjusted.
FIG. 5B shows a rule that prioritizes the adjustment of the speech speed, contrary to FIG. 5A.
FIG. 5C shows a rule in which adjustment of speech speed and pause length is mixed.

これらの様々なルールは、可変ルールテーブル１３０に格納されている可変ルールデータを入れ替えることにより、変更することができる。 These various rules can be changed by replacing the variable rule data stored in the variable rule table 130.

このように、可変ルールテーブル１３０を変更可能に構成しておくことにより、音声出力装置１００の個別の使用状況等に応じて逐一各構成を作りこむ必要がなくなり、単にデータファイルを入れ替えるのみで済むので、構成の柔軟性が増し、より多くの用途や環境に対し、音声出力装置１００を容易に対応させることができる。 In this way, by configuring the variable rule table 130 so as to be changeable, it is not necessary to create each configuration one by one according to the individual usage status of the audio output device 100, and it is only necessary to replace the data file. Therefore, the flexibility of the configuration is increased, and the audio output device 100 can be easily adapted to more applications and environments.

実施の形態３．
実施の形態１〜２において、タイミング解析部１２１は、音声入力開始通知（および音声出力終了通知）の受信時刻に基づいて、図２で説明したような判断を行うが、フレーム処理による一定の遅延が生じる。
そこで、本発明の実施の形態３では、このようなフレーム処理による遅延を低減する手法について説明する。 Embodiment 3 FIG.
In the first and second embodiments, the timing analysis unit 121 performs the determination described with reference to FIG. 2 based on the reception time of the voice input start notification (and the voice output end notification). Occurs.
Therefore, in Embodiment 3 of the present invention, a method for reducing the delay due to such frame processing will be described.

音声検出部１１０がタイミング解析部１２１に音声入力開始通知を出力するとき、その通知メッセージの中に、発話開始時刻そのものを入れ込んでおく。同様に、音声出力部１５０がタイミング解析部１２１に音声出力終了通知を出力するときに、音声出力終了時刻そのものを入れ込んでおく。
タイミング解析部１２１は、これらの通知の受信時刻に代えて、その通知に含まれているこれらの時刻を取得し、その値に基づいて、図２で説明したような判断を行う。
このように、通知の受信時刻ではなく、通知に含まれている時刻情報に基づき判断を行うことにより、フレーム処理による遅延を低減することができる。 When the voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121, the speech start time itself is inserted in the notification message. Similarly, when the sound output unit 150 outputs a sound output end notification to the timing analysis unit 121, the sound output end time itself is inserted.
The timing analysis unit 121 obtains these times included in the notification instead of the reception times of these notifications, and performs the determination described with reference to FIG. 2 based on the values.
As described above, the delay due to the frame processing can be reduced by making the determination based on the time information included in the notification instead of the reception time of the notification.

図６は、タイミング解析部１２１がユーザの確認音声の発話開始時刻を取得する手順を説明するものである。以下、図中の各ステップについて概略を説明する。 FIG. 6 illustrates a procedure in which the timing analysis unit 121 acquires the utterance start time of the confirmation voice of the user. The outline of each step in the figure will be described below.

（１）ユーザが確認音声の入力を開始すると、音声検出部１１０による音声信号の検出が開始される。 (1) When the user starts inputting confirmation voice, detection of the voice signal by the voice detector 110 is started.

（２）音声検出部１１０は、ユーザが確認音声の入力を開始すると即座に音声入力開始通知を出力するのではなく、先の実施の形態１で説明したように、例えば一定レベル以上のパワーを持つフレームが所定時間（例：３〜５フレーム）続けば、音声区間が開始されたと判断する。そのため、音声検出部１１０は、音声信号のフレーム蓄積を行う。
フレームを蓄積しながら、上述のようなパワー値の計算を平行して行う。 (2) The voice detection unit 110 does not immediately output a voice input start notification when the user starts input of the confirmation voice, but, for example, has a power of a certain level or more as described in the first embodiment. If the frame possessed continues for a predetermined time (for example, 3 to 5 frames), it is determined that the voice section has started. Therefore, the voice detection unit 110 performs frame accumulation of the voice signal.
While accumulating frames, the power value calculation as described above is performed in parallel.

（３）上記ステップ（２）を実行し、音声区間が開始されたと判断するに至ると、確認音声の音声検出が完了する。
（４）音声検出部１１０は、確認音声の音声検出が完了すると、即座に（ただしタイムラグあり）音声入力開始通知をタイミング解析部１２１に出力する。
（５）ユーザの確認音声の入力が終了する。 (3) When the above step (2) is executed and it is determined that the voice section is started, the voice detection of the confirmation voice is completed.
(4) When the voice detection of the confirmation voice is completed, the voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121 immediately (but with a time lag).
(5) The input of the user confirmation voice is completed.

図６に示すように、ユーザが確認音声の入力を開始してから、タイミング解析部１２１が音声入力開始通知を受け取るまでの間には、フレーム処理に伴うタイムラグが存在する。そこで、このタイムラグを見越して、以下のような手法により、より正確な発話開始時刻を取得することができる。 As shown in FIG. 6, there is a time lag associated with the frame processing between when the user starts inputting the confirmation voice and when the timing analysis unit 121 receives the voice input start notification. Therefore, in anticipation of this time lag, a more accurate utterance start time can be acquired by the following method.

（手法１）音声検出完了時刻を含めておく。
音声検出部１１０は、音声入力開始通知をタイミング解析部１２１に出力する際に、音声検出が完了した時刻を、発話開始時刻として同通知に含めておく。
ユーザが確認音声の入力を開始してから、タイミング解析部１２１が音声入力開始通知を受信するまでの間にタイムラグがあるとしても、同通知に含まれる音声検出完了時刻を参照して用いることにより、実際の音声入力開始時刻に近づく。 (Method 1) The time of voice detection completion is included.
When the voice detection unit 110 outputs a voice input start notification to the timing analysis unit 121, the voice detection unit 110 includes the time when the voice detection is completed as the utterance start time.
Even if there is a time lag between when the user starts input of the confirmation voice and when the timing analysis unit 121 receives the voice input start notification, by referring to the voice detection completion time included in the notification, It approaches the actual voice input start time.

（手法２）音声検出に要するフレーム蓄積時間を見越す。
図６のステップ（２）において、音声検出部１１０が「音声区間が開始された」と判断するのに必要なフレーム蓄積時間を見越し、上記手法１に記載の音声検出完了時刻から、さらにこのフレーム蓄積時間を減算する。音声検出部１１０は、減算後の時刻を、発話開始時刻として音声入力開始通知に含めておく。
これにより、実際の音声入力開始時刻により近づけることができる。 (Method 2) Allow for frame accumulation time required for voice detection.
In step (2) of FIG. 6, the voice detection unit 110 anticipates the frame accumulation time necessary for determining that “the voice section has started”, and further detects this frame from the voice detection completion time described in the method 1 above. Subtract the accumulation time. The voice detection unit 110 includes the time after the subtraction in the voice input start notification as the utterance start time.
Thereby, it can be brought closer to the actual voice input start time.

（手法３）演算に要する時間を減算する。
上記手法２に加え、音声検出部１１０が音声信号のパワー計算等に要する演算時間を減算し、減算後の時刻を、発話開始時刻として音声入力開始通知に含めておく。演算時間は他の時間と比較して僅かであると思われるので、本手法は省略してもよい。 (Method 3) The time required for the calculation is subtracted.
In addition to the above method 2, the voice detection unit 110 subtracts the calculation time required for calculating the power of the voice signal and the time after the subtraction is included in the voice input start notification as the utterance start time. Since the calculation time seems to be short compared with other times, this method may be omitted.

（手法４）サンプリング周波数とサンプリング通番から逆算する。
上記手法１〜３において、確認音声をデジタル録音する場合には、サンプリング周波数とサンプリング通番から時刻を逆算することもできる。
この場合、音声検出部１１０は、ユーザの確認音声をフレーム蓄積する際に、デジタル音声データとしてサンプリングして蓄積する。各サンプルには通し番号を採番しておく。
例えば手法２において、フレーム蓄積時間を減算する際に、時間そのものを減算することに代えて、戻り先のサンプル番号までの番号数にサンプリング周波数を乗算することにより、減算すべき時間を算出することができる。 (Method 4) Back-calculate from sampling frequency and sampling sequence number.
In the above methods 1 to 3, when the confirmation voice is digitally recorded, the time can be calculated backward from the sampling frequency and the sampling sequence number.
In this case, the voice detection unit 110 samples and stores the user's confirmation voice as digital voice data when storing the frame. A serial number is assigned to each sample.
For example, in method 2, when subtracting the frame accumulation time, instead of subtracting the time itself, the time to be subtracted is calculated by multiplying the number up to the return sample number by the sampling frequency. Can do.

上述の（手法１）〜（手法４）では、音声入力開始通知について説明したが、その他の各通知についても、同様の手法を採用することにより、正確な時刻を取得することができる。 In the above (Method 1) to (Method 4), the voice input start notification has been described. However, for each of the other notifications, an accurate time can be acquired by adopting the same method.

上述の（手法１）〜（手法４）いずれを用いる場合でも、タイミング解析部１２１は、音声入力開始通知に含まれる発話開始時刻を取得することにより、ユーザが実際に確認音声の入力を開始した時刻に極力近い時刻を取得することができる。
これにより、先の図２〜図４で説明した閾値ＴＨ１〜ＴＨ３に基づく判定がより正確になるので、ユーザが意図しない閾値判定が行われる可能性が低減され、ユーザの便宜に資する。 Even when any of the above (Method 1) to (Method 4) is used, the timing analysis unit 121 acquires the utterance start time included in the voice input start notification, so that the user has actually started to input the confirmation voice. The time as close as possible to the time can be acquired.
Thereby, since the determination based on the threshold values TH1 to TH3 described in FIGS. 2 to 4 is more accurate, the possibility that the threshold determination that is not intended by the user is performed is reduced, which contributes to the convenience of the user.

なお、各通知に減算後の時刻を含めておくことを説明したが、減算処理は各通知を受け取った後に行うようにしてもよい。 Although it has been described that the time after subtraction is included in each notification, the subtraction process may be performed after each notification is received.

実施の形態４．
図７は、本発明の実施の形態４に係る音声出力装置１００の機能ブロック図である。
本実施の形態４に係る音声出力装置１００は、実施の形態１の図１で説明した音声ＤＢ１４０に代えて、発話テキスト１６０と音声合成部１７０を備える。その他の構成は実施の形態１〜３と同様であるため、説明を省略する。 Embodiment 4 FIG.
FIG. 7 is a functional block diagram of the audio output device 100 according to Embodiment 4 of the present invention.
The voice output device 100 according to the fourth embodiment includes an utterance text 160 and a voice synthesizer 170 instead of the voice DB 140 described in FIG. 1 of the first embodiment. Since other configurations are the same as those of the first to third embodiments, the description thereof is omitted.

実施の形態１では、あらかじめ発話内容に即した音声データ（ｗａｖファイル）を、話速毎に音声ＤＢ１４０に格納しておき、これを読み出して音声出力することとした。
本実施の形態４では、発話内容のテキストのみを発話テキスト１６０として格納しておき、音声合成によって出力音声を動的に生成する。話速を可変する際には、音声合成時に話速を指定パラメータとして与える。 In the first embodiment, voice data (wav file) corresponding to the utterance content is stored in the voice DB 140 for each speech speed in advance, and this is read out and output as voice.
In the fourth embodiment, only the text of the utterance content is stored as the utterance text 160, and the output speech is dynamically generated by speech synthesis. When changing the speech speed, the speech speed is given as a designated parameter during speech synthesis.

発話テキスト１６０は、メモリやＨＤＤ等の記憶装置に、テキストファイルなどのデータファイルを格納することにより構成することができる。発話テキスト１６０は、音声ブロック毎に、区切り記号やファイル分割によって分割されて構成されている。
音声合成部１７０は、発話テキスト１６０の内容に基づき音声合成を行うもので、一般にＴＴＳ（ＴｅｘｔＴｏＳｐｅｅｃｈ）として知られている技術を用いることができる。なお、外部パラメータとして、上述の話速指定パラメータを受け取り、出力する音声がこれに合わせた話速となるように音声合成を行う。 The utterance text 160 can be configured by storing a data file such as a text file in a storage device such as a memory or HDD. The utterance text 160 is divided for each voice block by a delimiter or a file division.
The speech synthesizer 170 performs speech synthesis based on the content of the utterance text 160, and a technique generally known as TTS (Text To Speech) can be used. Note that the above-described speech speed designation parameter is received as an external parameter, and speech synthesis is performed so that the output speech has a speech speed according to the parameter.

次に、本実施の形態４に係る音声出力装置１００の動作について説明する。
本実施の形態４における音声出力装置１００の動作は、実施の形態１〜３で説明したものと概ね同様である。相違点を中心に、以下に簡単に説明する。 Next, the operation of the audio output device 100 according to the fourth embodiment will be described.
The operation of the audio output device 100 according to the fourth embodiment is substantially the same as that described in the first to third embodiments. The following is a brief description focusing on the differences.

音声出力部１５０は、音声合成部１７０に、音声出力指示部１２２から指示された話速で音声合成を行うように依頼する。
音声合成部１７０は、１文に相当するテキストデータを発話テキスト１６０より読み取り、通常の話速で音声合成し、音声出力部１５０より音声出力する。
そして、ポーズ区間中に音声検出部１１０がユーザからの確認音声の入力を検出すると、タイミング解析部１２１の解析結果に基づき、音声出力指示部１２２が話速とポーズ長を音声出力部１５０に指示する。
音声合成部１７０は、音声出力部１５０より指示された話速で音声合成を行い、音声出力部１５０に出力する。音声出力部１５０は、音声合成部１７０より受け取った合成音声を音声出力する。 The voice output unit 150 requests the voice synthesizer 170 to perform voice synthesis at the speaking speed instructed from the voice output instruction unit 122.
The speech synthesizer 170 reads text data corresponding to one sentence from the utterance text 160, synthesizes speech at a normal speech speed, and outputs the speech from the speech output unit 150.
Then, when the voice detection unit 110 detects the input of the confirmation voice from the user during the pause period, the voice output instruction unit 122 instructs the voice output unit 150 on the speech speed and pause length based on the analysis result of the timing analysis unit 121. To do.
The speech synthesizer 170 synthesizes speech at the speaking speed instructed by the speech output unit 150 and outputs it to the speech output unit 150. The voice output unit 150 outputs the synthesized voice received from the voice synthesis unit 170 as a voice.

以上のように、本実施の形態４によれば、発話テキスト１６０を１種類準備しておくのみで、音声合成によって動的に任意の話速の音声出力を行うことができるので、全ての話速に対応した音声データ（ｗａｖファイル）を準備する必要がなくなり、事前の構築の手間が軽減される。 As described above, according to the fourth embodiment, since only one type of utterance text 160 is prepared, voice output at an arbitrary speaking speed can be dynamically performed by voice synthesis. It is no longer necessary to prepare audio data (wav file) corresponding to the speed, and the time and effort of the previous construction is reduced.

また、音声データは一般にデータサイズが巨大になるが、本実施の形態４の構成によれば、テキストデータのみを格納しておけばよいので、データサイズは小さくて済み、音声出力装置１００の小型化や低コスト化に資する。 In addition, although the audio data generally has a huge data size, according to the configuration of the fourth embodiment, only the text data needs to be stored, so the data size can be small, and the audio output device 100 can be small. Contributes to cost reduction and cost reduction.

実施の形態５．
実施の形態１〜４では、ユーザが任意の確認音声を入力し、音声検出部１１０がこれを検出するタイミングを、タイミング解析部１２１が図２のような判断基準で解析することにより、話速やポーズ長を可変することを説明した。
本発明の実施の形態５では、ユーザが入力する確認音声を、特定の単語やフレーズに限定する構成を説明する。 Embodiment 5 FIG.
In the first to fourth embodiments, the user inputs an arbitrary confirmation voice, and the timing analysis unit 121 analyzes the timing at which the voice detection unit 110 detects the confirmation voice according to the determination criteria as shown in FIG. And explained how to change the pose length.
In the fifth embodiment of the present invention, a configuration is described in which the confirmation voice input by the user is limited to a specific word or phrase.

図８は、本実施の形態５に係る音声出力装置１００の機能ブロック図である。
図８において、音声検出部１１０と発話ペース制御部１２０の間に、新たに音声認識部１８０を設けた。その他の構成は、実施の形態１で説明した図１と同様であるため、説明を省略する。なお、その他の実施の形態の構成を用いてもよい。 FIG. 8 is a functional block diagram of the audio output device 100 according to the fifth embodiment.
In FIG. 8, a voice recognition unit 180 is newly provided between the voice detection unit 110 and the utterance pace control unit 120. Other configurations are the same as those in FIG. 1 described in the first embodiment, and thus description thereof is omitted. Note that the configurations of other embodiments may be used.

音声認識部１８０は、ユーザが入力した確認音声を音声検出部１１０より受け取り、音声認識処理を行う。音声認識の結果、所定の予約語が入力されたものと判断した場合は、タイミング解析部１２１に音声入力開始通知を出力する。
ここでいう予約語とは、確認音声であることが明示的に分かるものが好ましく、例えば「はい」「うん」といった簡単なものでよい。 The voice recognition unit 180 receives the confirmation voice input by the user from the voice detection unit 110 and performs voice recognition processing. As a result of voice recognition, when it is determined that a predetermined reserved word has been input, a voice input start notification is output to the timing analysis unit 121.
The reserved word here is preferably one that clearly indicates that it is a confirmation voice, and may be a simple one such as “Yes” or “Yes”.

以下、本実施の形態５における音声出力装置１００の動作を簡単に説明する。
ユーザが確認音声を入力すると、音声検出部１１０がその音声を検出し、音声認識部１８０に出力する。
音声認識部１８０は、音声認識処理を行い、所定の予約語であればタイミング解析部１２１に音声入力開始通知を出力し、予約語でなければ何も出力しない。
その他の動作は他の実施の形態と同様であるため、説明を省略する。 Hereinafter, the operation of the audio output device 100 according to the fifth embodiment will be briefly described.
When the user inputs a confirmation voice, the voice detection unit 110 detects the voice and outputs it to the voice recognition unit 180.
The speech recognition unit 180 performs speech recognition processing, outputs a speech input start notification to the timing analysis unit 121 if it is a predetermined reserved word, and outputs nothing if it is not a reserved word.
Since other operations are the same as those of the other embodiments, description thereof is omitted.

以上のように、本実施の形態５によれば、ユーザが入力した確認音声以外の音声、例えば背景音などが、誤って確認音声として検出されてしまうことがなくなる。これにより、ユーザが意図せずに発話ペースが変更されてしまう可能性が大きく低減され、ユーザの便宜に資する。 As described above, according to the fifth embodiment, a sound other than the confirmation sound input by the user, such as a background sound, is not erroneously detected as the confirmation sound. Thereby, the possibility that the speech pace is changed unintentionally by the user is greatly reduced, which contributes to the convenience of the user.

実施の形態６．
以上の実施の形態１〜５において、ユーザが確認音声を入力するタイミングにより話速やポーズ長を制御することを説明したが、確認音声の入力に代えて、マウスなど他のデバイスで入力することもできる。 Embodiment 6 FIG.
In Embodiments 1 to 5 described above, it has been described that the speech speed and pause length are controlled by the timing at which the user inputs the confirmation sound. However, instead of inputting the confirmation sound, the sound is input by another device such as a mouse. You can also.

実施の形態７．
実施の形態１において、各話速に対応した音声データを音声ＤＢ１４０に格納しておくことを説明したが、これに代えて、単一の話速（例えば標準話速）に対応した音声データを格納しておき、音声出力する際に、話速変換装置を介在させて話速を変換するように構成してもよい。
また、実施の形態４において、音声合成部１７０の外部パラメータとして話速を指示することを説明したが、これに代えて、同様に話速変換装置を介在させて話速を変換するように構成してもよい。
その他の実施の形態についても同様に、話速変換装置を介在させて話速を変換するように構成してもよい。 Embodiment 7 FIG.
In the first embodiment, it has been described that voice data corresponding to each speech speed is stored in the voice DB 140. Instead, voice data corresponding to a single speech speed (for example, standard speech speed) is stored. It may be configured to store the voice and convert the voice speed by interposing a voice speed converter when outputting the voice.
Further, in the fourth embodiment, it has been described that the speech speed is instructed as an external parameter of the speech synthesizer 170, but instead, the speech speed is similarly converted by interposing a speech speed conversion device. May be.
Similarly, the other embodiments may be configured to convert the speech speed by interposing a speech speed conversion device.

実施の形態８．
実施の形態１の図２（ｄ）の説明において、ユーザが直前の音声ブロックに遅れて応答したのか、それとも現在の音声ブロックに即座に応答したのか、いずれであるかが判断できないため、ひとまず「ＫＥＥＰ」とすることを説明した。
本発明の実施の形態８では、より積極的な新しい動作として「ＷＡＩＴ」を定義し、同区間におけるユーザ応答に対してこの「ＷＡＩＴ」動作を行うことを説明する。
なお、各構成や基本的な動作は上述の実施の形態と同様であるため、説明を省略する。 Embodiment 8 FIG.
In the description of FIG. 2D of the first embodiment, it cannot be determined whether the user has responded late to the immediately preceding speech block or whether the user has responded immediately to the current speech block. “KEEP”.
In the eighth embodiment of the present invention, “WAIT” is defined as a more aggressive new operation, and this “WAIT” operation is performed for a user response in the same section.
Each configuration and basic operation are the same as those in the above-described embodiment, and thus description thereof is omitted.

本実施の形態８では、タイミング解析部１２１が図２（ｄ）の区間で音声入力開始通知を受け取ると、「ＷＡＩＴ」を出力する。これは、話速やポーズ長を変更しない点では「ＫＥＥＰ」と同様であるが、「ＷＡＩＴ」を出力した旨を内部的に記憶しておく点が、「ＫＥＥＰ」とは異なる。
次に、「ＷＡＩＴ」を出力した旨を内部的に記憶しておくことの意義と、具体的な動作について、ユーザの観点から説明する。 In the eighth embodiment, when the timing analysis unit 121 receives the voice input start notification in the section of FIG. 2D, it outputs “WAIT”. This is similar to “KEEP” in that the speech speed and pause length are not changed, but is different from “KEEP” in that “WAIT” is internally stored.
Next, the significance of storing “WAIT” output internally and the specific operation will be described from the viewpoint of the user.

ユーザが図２（ｄ）の区間でユーザの確認音声を検出したとき、ユーザとしては、図２（ｃ）または（ｅ）の区間のつもりで確認音声を入力したと思われる。このときのタイミング解析部１２１の動作は「ＷＡＩＴ」であるため、話速やポーズ長は変化しない。
ところが、ユーザとしては話速やポーズ長を変化させるつもりで確認音声を入力したため、音声出力装置１００はユーザの意図通りに動作しなかったことになる。
この場合、ユーザは、次に確認音声を入力するときには、図２（ｃ）または（ｅ）の区間に合致するように入力タイミングを自主的に微調整するものと思われる。 When the user detects the user's confirmation voice in the section of FIG. 2D, it is assumed that the user has input the confirmation voice in the section of FIG. 2C or FIG. Since the operation of the timing analysis unit 121 at this time is “WAIT”, the speech speed and the pause length do not change.
However, since the confirmation voice is input as the user intends to change the speech speed and pause length, the voice output device 100 does not operate as intended by the user.
In this case, when the user next inputs the confirmation voice, the input timing seems to be automatically finely adjusted so as to match the section of FIG. 2 (c) or (e).

入力タイミングの微調整の結果、次回の確認音声は、図２（ｃ）または（ｅ）の区間で行われる。このときに初めて「ＵＰ」や「ＤＯＷＮ」を出力するとなると、ユーザから見れば、１回分の動作が損なわれたことになってしまい、ユーザの使用感を損ねる。
そこで、前回「ＷＡＩＴ」を出力した旨を内部的に記憶していることを利用し、前回動作分と今回動作分を合わせて、２回分の「ＵＰ」や「ＤＯＷＮ」を実行することとする。具体的には、「ＵＰ」や「ＤＯＷＮ」を２回分連続的に出力してもよいし、１回の動作で制御量を２倍にするように音声出力指示部１２２へ指示してもよい。 As a result of the fine adjustment of the input timing, the next confirmation voice is performed in the section of FIG. 2 (c) or (e). If “UP” or “DOWN” is output for the first time at this time, the operation for one time is impaired from the viewpoint of the user, which impairs the user's feeling of use.
Therefore, using the fact that the information indicating that “WAIT” was output last time is stored internally, “UP” and “DOWN” are executed twice for the previous operation and the current operation. . Specifically, “UP” or “DOWN” may be output continuously twice, or the voice output instruction unit 122 may be instructed to double the control amount in one operation. .

以上のように、本実施の形態８によれば、新しい動作として「ＷＡＩＴ」を定義し、前回「ＷＡＩＴ」を出力した旨を内部的に記憶し、次回動作時には前回動作分も合わせて動作を行うこととしたので、確認音声の入力タイミングが図２（ｄ）であった場合でも、次回以降のユーザの使用感を損なうことなく、使い勝手のよい音声出力装置を提供することができる。 As described above, according to the eighth embodiment, “WAIT” is defined as a new operation, the fact that the previous “WAIT” is output is internally stored, and the operation is performed together with the previous operation at the next operation. Therefore, even when the input timing of the confirmation voice is as shown in FIG. 2D, a user-friendly voice output device can be provided without impairing the user's feeling after the next time.

実施の形態９．
図９は、本発明の実施の形態９に係る音声ガイダンスシステムの構成図である。
図９において、音声出力装置１００とユーザ端末２００は、ネットワーク３００を介して接続されている。 Embodiment 9 FIG.
FIG. 9 is a configuration diagram of a voice guidance system according to Embodiment 9 of the present invention.
In FIG. 9, the audio output device 100 and the user terminal 200 are connected via a network 300.

音声出力装置１００は、実施の形態１〜８で説明した構成を備えるものである。なお、ネットワーク３００と接続するためのインターフェースを適宜備えるものとする。
ユーザ端末２００は、マイクとスピーカを備えるコンピュータである。
ネットワーク３００は、有線または無線の通信回線である。
以下、図９の音声ガイダンスシステムの動作について簡単に説明する。 The audio output device 100 has the configuration described in the first to eighth embodiments. It is assumed that an interface for connecting to the network 300 is appropriately provided.
The user terminal 200 is a computer that includes a microphone and a speaker.
The network 300 is a wired or wireless communication line.
Hereinafter, the operation of the voice guidance system of FIG. 9 will be briefly described.

（１）ユーザは、ユーザ端末２００を用いて、ネットワーク３００を介し音声出力装置１００に音声ガイダンスを要求する。
（２）音声出力装置１００は、ユーザ端末２００からの要求を受け取り、音声出力部１５０より音声出力する。出力音声は、音声データとしてネットワーク３００を介してユーザ端末２００に送信される。
（３）ユーザは、ユーザ端末２００のマイクに確認音声を入力する。確認音声は、音声データとしてネットワーク３００を介して音声出力装置１００に送信される。
（４）音声検出部１１０は、ネットワーク３００より確認音声の音声データを受信する。以後の動作は、各実施の形態で説明したものと同様である。 (1) The user uses the user terminal 200 to request voice guidance from the voice output device 100 via the network 300.
(2) The voice output device 100 receives a request from the user terminal 200 and outputs a voice from the voice output unit 150. The output voice is transmitted as voice data to the user terminal 200 via the network 300.
(3) The user inputs confirmation voice into the microphone of the user terminal 200. The confirmation voice is transmitted as voice data to the voice output device 100 via the network 300.
(4) The voice detection unit 110 receives voice data of confirmation voice from the network 300. Subsequent operations are the same as those described in each embodiment.

本実施の形態９では、ユーザ端末２００はコンピュータであるものとしたが、その他の音声入出力が可能な端末、例えば携帯電話端末であってもよい。ネットワーク３００は、端末の種類に応じて適切な通信網とする。 In the ninth embodiment, the user terminal 200 is a computer, but may be another terminal capable of voice input / output, for example, a mobile phone terminal. The network 300 is an appropriate communication network according to the type of terminal.

以上のように、本実施の形態９によれば、ネットワーク３００を介して、音声出力装置１００による音声ガイダンスを提供することができるので、音声出力装置１００を様々な利用形態で用いることができる。 As described above, according to the ninth embodiment, since the voice guidance by the voice output device 100 can be provided via the network 300, the voice output device 100 can be used in various usage forms.

実施の形態１に係る音声出力装置１００の機能ブロック図である。3 is a functional block diagram of the audio output device 100 according to Embodiment 1. FIG. タイミング解析部１２１が、音声入力開始通知を受信するタイミングに基づき話速やポーズ長を可変するための判断基準の１例である。This is an example of a criterion for the timing analysis unit 121 to vary the speech speed and pause length based on the timing at which a voice input start notification is received. 可変ルールテーブル１３０の構成とデータ例を示すものである。The structure of the variable rule table 130 and an example of data are shown. 音声出力装置１００の１動作例を示すものである。An example of the operation of the audio output device 100 is shown. 可変ルールテーブル１３０の別の構成例を示すものである。6 shows another configuration example of the variable rule table 130. タイミング解析部１２１がユーザの確認音声の発話開始時刻を取得する手順を説明するものである。The procedure by which the timing analysis unit 121 acquires the utterance start time of the confirmation voice of the user will be described. 実施の形態４に係る音声出力装置１００の機能ブロック図である。6 is a functional block diagram of an audio output device 100 according to Embodiment 4. FIG. 実施の形態５に係る音声出力装置１００の機能ブロック図である。10 is a functional block diagram of an audio output device 100 according to Embodiment 5. FIG. 実施の形態９に係る音声ガイダンスシステムの構成図である。FIG. 20 is a configuration diagram of a voice guidance system according to a ninth embodiment.

Explanation of symbols

１００音声出力装置、１１０音声検出部、１２０発話ペース制御部、１３０可変ルールテーブル、１４０音声データベース、１５０音声出力部、１６０発話テキスト、１７０音声合成部、１８０音声認識部、２００ユーザ端末、３００ネットワーク。 DESCRIPTION OF SYMBOLS 100 Voice output device, 110 Voice detection part, 120 Speech pace control part, 130 Variable rule table, 140 Voice database, 150 Voice output part, 160 Speech text, 170 Voice synthesizer, 180 Voice recognition part, 200 User terminal, 300 Network .

Claims

By inputting voice,
A voice output device that varies the speech speed and / or pause length of the output voice,
A voice detection unit for detecting voice input;
An audio output unit for outputting predetermined output audio;
A control unit for controlling the speech speed or pause length of the output voice or both;
With
The controller is
After the sound output unit outputs the output sound of one block,
Based on the elapsed time until the voice detection unit detects voice input,
An audio output device that controls the speech speed and / or pause length of the output audio of the next block.

The controller is
If the elapsed time is less than a predetermined first threshold,
Increase the speech speed of the output sound of the next block, shorten the pause length, or both,
When the elapsed time is equal to or greater than a predetermined second threshold (where the first threshold is less than the second threshold)
Decrease the speaking speed of the output sound of the next block, increase the pause length, or both,
When the elapsed time is not less than the first threshold and less than the second threshold,
The speech output device according to claim 1, wherein the speech speed and / or pause length of the output speech of the next block is maintained.

The controller is
When the voice detection unit detects a voice input after the voice output unit starts outputting the output voice and before a predetermined third threshold value elapses,
The speech output device according to claim 2, wherein the speech speed and / or pause length of the output speech of the next block is maintained.

The controller is
When the voice detection unit detects voice input while the voice output unit starts outputting the output voice and outputs the output voice after the third threshold has elapsed,
The speech output apparatus according to claim 3, wherein the speech speed of the output speech of the next block is increased, the pause length is shortened, or both.

The controller is
The first threshold, the second threshold, and the third threshold are:
The voice output device according to claim 3 or 4, wherein the voice output device is determined based on a speech speed and / or a pause length for each block of the output voice.

A storage unit storing a variable rule of the speech speed or pause length of the output voice or both,
The audio output device according to claim 1, wherein the control unit controls the speech speed and / or pause length of the output audio with reference to the variable rule.

The controller is
The time when voice input is started is received from the voice detector,
Alternatively, the time when the audio output is completed is received from the audio output unit,
The audio output device according to any one of claims 1 to 6, wherein the elapsed time is counted based on these times.

The voice detection unit
Voice recognition means for recognizing the content of the input voice,
Only when the voice recognition means recognizes that a predetermined reserved word has been input,
The audio output device according to any one of claims 1 to 7, wherein it is determined that audio is input.