JP2006154531A

JP2006154531A - Audio speed conversion device, audio speed conversion method, and audio speed conversion program

Info

Publication number: JP2006154531A
Application number: JP2004347391A
Authority: JP
Inventors: Meiko Masaki; 芽衣子正木; Masayuki Misaki; 正之三崎
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2004-11-30
Filing date: 2004-11-30
Publication date: 2006-06-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device, a method, and a program for speech speed conversion that can accurately calculate the speech speed of a speaker's speech included in a speech signal to convert the speech speed of a reproduced speech that a user desires. <P>SOLUTION: A pause detection portion 12 decides a speech section and a non-speech section (pause) from a speech signal stored in a speech signal storage portion 11. A statistical data storage portion 13 stores statistical data regarding predetermined pause lengths. A speech speed calculation portion 14 measures pause lengths of respective pauses detected by the pause detection portion 12 and calculates the speech speed based upon the pause lengths and the statistical data stored in the statistical data storage portion 13. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声速度変換装置、音声速度変換方法、および音声速度変換プログラムに関し、より特定的には、音声信号の話速変換を行うことが可能な音声速度変換装置、音声速度変換方法および音声速度変換プログラムに関する。 The present invention relates to an audio speed conversion apparatus, an audio speed conversion method, and an audio speed conversion program. More specifically, the present invention relates to an audio speed conversion apparatus, an audio speed conversion method, and an audio capable of converting a speech speed of an audio signal. It relates to a speed conversion program.

従来から、話者が発声する速さ（以下、話速という）を一定の圧縮伸長率で話速変換を行う方法が知られている。例えば、会話の音声信号を再生するとき、実際の話者の話速を一定の圧縮伸長率で話速変換して、音声信号に含まれる再生音声の話速を変えることができる。ここで、実際の会話では、複数の話者がそれぞれ異なる話速で話す場合がある。また、同一の話者であっても異なる話速で話す場合もある。つまり、実際の話者の話速は一定でない場合が多い。したがって、実際の話者の話速が異なる場合、一定の圧縮伸長率で話速変換された再生音声の話速は、ユーザが所望する話速よりも速くまたは遅くなる可能性がある。その結果、ユーザにとって話者の音声が聞き取りにくい部分が生じるという問題がある。 2. Description of the Related Art Conventionally, a method is known in which a speech rate is converted with a constant compression / expansion rate at a rate at which a speaker speaks (hereinafter referred to as a speech rate). For example, when reproducing a speech signal of a conversation, the speech speed of an actual speaker can be changed at a constant compression / expansion rate to change the speech speed of the reproduced speech included in the speech signal. Here, in an actual conversation, a plurality of speakers may speak at different speaking speeds. In addition, even the same speaker may speak at different speaking speeds. In other words, the actual speaking speed of the speaker is often not constant. Therefore, when the speaking speed of the actual speaker is different, the speaking speed of the reproduced voice that has been converted at a certain compression / expansion rate may be faster or slower than the speaking speed desired by the user. As a result, there is a problem in that there are portions where it is difficult for the user to hear the voice of the speaker.

そこで、実際の話者の話速を検出し、その話速に応じた圧縮伸長率を設定して話速変換を行う方法が提案されている（例えば特許文献１参照）。以下に、特許文献１に開示された音声の圧縮伸長装置について図１７を用いて説明する。 Therefore, a method has been proposed in which the speech speed of an actual speaker is detected and speech speed conversion is performed by setting a compression / decompression rate corresponding to the speech speed (see, for example, Patent Document 1). The audio compression / decompression apparatus disclosed in Patent Document 1 will be described below with reference to FIG.

なお、上記特許文献１では、実際の話者の話速は単位時間当たりの音節数で定義され、「発声速度」と称される。ここで音節とは、一定の声の長さを持つ音素（例えば母音）のまとまり、または一定の声の長さを持つ音素の前および／または後に非常に短い音素（例えば子音）を従えるまとまりを意味する。 In Patent Document 1, the actual speaking speed of a speaker is defined by the number of syllables per unit time and is referred to as “speech rate”. Here, a syllable is a group of phonemes (eg, vowels) with a certain voice length, or a group that can follow a very short phoneme (eg, consonants) before and / or after a phoneme with a certain voice length. means.

図１７は、音声の圧縮伸長装置をＩＣレコーダに適用した構成を示すブロック図である。図１７において、ＩＣレコーダ２００は、マイク２０７、Ａ／Ｄ変換器２０８、ＩＣメモリ２０１、圧縮伸長装置２０６、Ｄ／Ａ変換器２０５、およびスピーカ２０９を備える。 FIG. 17 is a block diagram showing a configuration in which an audio compression / decompression apparatus is applied to an IC recorder. 17, the IC recorder 200 includes a microphone 207, an A / D converter 208, an IC memory 201, a compression / decompression device 206, a D / A converter 205, and a speaker 209.

ＩＣレコーダ２００は、話者の音声の記録時にはマイク２０７から入力される音声のアナログ信号をＡ／Ｄ変換器２０８においてデジタル信号に変換し、変換された音声のデジタル信号をＩＣメモリ２０１に記録する。また、ＩＣレコーダ２００は、話者の音声の再生時にはＩＣメモリ２０１に記録された音声のデジタル信号を圧縮伸長装置２０６において時間軸上で圧縮伸長する。その後、圧縮伸長された音声のデジタル信号をＤ／Ａ変換器２０５においてアナログ信号に変換し、圧縮伸長された音声のアナログ信号をスピーカ２０９から再生する。 The IC recorder 200 converts an analog audio signal input from the microphone 207 into a digital signal in the A / D converter 208 when recording the voice of the speaker, and records the converted digital audio signal in the IC memory 201. . Also, the IC recorder 200 compresses and expands the digital signal of the voice recorded in the IC memory 201 on the time axis in the compression / decompression device 206 when reproducing the voice of the speaker. Thereafter, the compressed / expanded audio digital signal is converted into an analog signal by the D / A converter 205, and the compressed / expanded audio analog signal is reproduced from the speaker 209.

圧縮伸長装置２０６は、発声速度検出部２０２、圧縮伸長率調節部２０３、およびピッチ伸長圧縮部２０４を備える。典型的には、圧縮伸長装置２０６は、ＤＳＰによって構成される。発声速度検出部２０２は、ＩＣメモリ２０１に記録された音声のデジタル信号から音声信号の時間軸波形を生成し、当該時間軸波形のエンベロープにスムージング処理を施す。そして、スムージング処理された波形から各音節を構成する波形のピーク位置を所定時間毎に検出して、ピーク数を計測する。その後、当該ピーク数を音節数とし、音節数を所定時間長で除した単位時間あたりの音節数を発声速度として算出する。ここで、ピークとは各音節を構成する波形において、レベルが最大の箇所をいう。 The compression / decompression device 206 includes an utterance speed detection unit 202, a compression / expansion rate adjustment unit 203, and a pitch expansion / compression unit 204. Typically, the compression / decompression device 206 is configured by a DSP. The utterance speed detection unit 202 generates a time axis waveform of the audio signal from the audio digital signal recorded in the IC memory 201, and performs a smoothing process on the envelope of the time axis waveform. And the peak position of the waveform which comprises each syllable is detected for every predetermined time from the waveform by which the smoothing process was carried out, and the number of peaks is measured. Thereafter, the number of syllables per unit time obtained by dividing the number of peaks by the number of syllables and dividing the number of syllables by a predetermined time length is calculated as the utterance speed. Here, the peak means a portion having the maximum level in the waveform constituting each syllable.

圧縮伸長率調節部２０３は、発声速度検出部２０２で算出された発話速度に基づいて圧縮伸長率を調節する。一例として、発声速度が８音節数／秒であるときを調整の際の基準値とする。発声速度が上記基準値であるとき、圧縮伸長率は２倍速再生では間引き率を５０％とし、０．５倍速再生では挿入率を５０％とする。また、発声速度が上記基準値より大きく速い場合には、圧縮伸長率は２倍速再生では間引き率を５０％未満とし、０．５倍速再生では挿入率を５０％以上とする。つまり、発声速度が上記基準値であるときの再生に対して、それぞれ遅く再生するように調節される。また、発声速度が上記基準値より小さく遅い場合には、圧縮伸長率は２倍速再生では間引き率を５０％以上とし、０．５倍速再生では挿入率を５０％未満とする。つまり、発声速度が上記基準値であるときの再生に対して、それぞれ速く再生するように調節される。 The compression / decompression rate adjustment unit 203 adjusts the compression / decompression rate based on the speech rate calculated by the speech rate detection unit 202. As an example, when the utterance speed is 8 syllables / second, the reference value for adjustment is used. When the utterance speed is the above-mentioned reference value, the compression / decompression rate is 50% for double-speed playback and 50% for 0.5-speed playback. When the utterance speed is greater than the reference value, the compression / decompression rate is set to less than 50% for double speed playback and to 50% or more for 0.5 speed playback. That is, the reproduction is adjusted so as to be delayed with respect to the reproduction when the utterance speed is the reference value. When the utterance speed is smaller than the reference value and slow, the compression / decompression ratio is set to 50% or more for the double speed reproduction, and the insertion ratio is set to less than 50% for the 0.5 speed reproduction. That is, the playback speed is adjusted to be faster than the playback speed when the utterance speed is the reference value.

ピッチ伸長圧縮部２０４は、圧縮伸長率調節部２０３で調節された圧縮伸長率の情報に基づいて、ＩＣメモリ２０１に記録された音声のデジタル信号を時間軸上にて圧縮伸長することにより、話者の音声の話速変換を行う。 The pitch expansion / compression unit 204 compresses / decompresses the audio digital signal recorded in the IC memory 201 on the time axis based on the information of the compression / expansion rate adjusted by the compression / expansion rate adjustment unit 203, thereby The voice speed of the person's voice is converted.

以上のように、音声の圧縮伸長装置２０６は、所定時間毎に話者の音声を構成する時間軸波形のピーク数に基づいて発声速度を算出する。そして、算出された発声速度に応じて、圧縮伸長率を調節して話速変換を行うことができる。
特開平７−６４５９７号公報 As described above, the voice compression / decompression apparatus 206 calculates the utterance speed based on the number of peaks of the time axis waveform constituting the voice of the speaker every predetermined time. Then, the speech rate conversion can be performed by adjusting the compression / decompression rate according to the calculated speech rate.
Japanese Patent Laid-Open No. 7-64597

ここで、話速変換の対象となる上記話者の音声に当該音声以外の他の信号（例えば音楽信号、雑音信号など）が重畳した音声信号（例えば、テレビ番組、ラジオ番組、記録媒体に記録された映画などの音声信号）を入力とする場合、その音声信号に含まれる話者の音声の時間軸波形は、話者の音声以外の他の信号が重畳した波形となる。このため、対象となる話者の音声に当該音声以外の他の信号が重畳した音声信号において、音節を構成する波形のピーク位置は、必ずしも実際の話者の音声における音節のピーク位置と対応しない場合がある。つまり、音節を構成する波形のピーク位置が必ずしも人の音声による音節に対応するとは限らない。しかしながら、従来の音声の圧縮伸長装置２０６では、音節を構成する波形のピーク位置を検出してピーク数を計測し、当該ピーク数に基づいて発声速度を算出するため、話速変換の対象となる話者の音声の発声速度を正確に算出することが難しい。その結果、ユーザが所望する再生音声の話速に変換できないという問題があった。 Here, an audio signal (for example, a television program, a radio program, or a recording medium) in which a signal other than the audio (for example, a music signal, a noise signal, etc.) is superimposed on the voice of the speaker to be subjected to speech speed conversion is recorded. When a voice signal of a movie or the like is input, the time axis waveform of the voice of the speaker included in the voice signal is a waveform on which signals other than the voice of the speaker are superimposed. For this reason, the peak position of the waveform constituting the syllable does not necessarily correspond to the peak position of the syllable in the actual speaker's voice in the voice signal in which a signal other than the voice is superimposed on the voice of the target speaker. There is a case. In other words, the peak position of the waveform constituting the syllable does not necessarily correspond to the syllable by human voice. However, the conventional speech compression / decompression device 206 detects the peak position of the waveform constituting the syllable, measures the number of peaks, and calculates the utterance speed based on the number of peaks. It is difficult to accurately calculate the speaking rate of the speaker's voice. As a result, there has been a problem that it is not possible to convert the voice speed of the reproduced voice desired by the user.

それ故に、本発明の目的は、話速変換の対象となる話者の音声に当該音声以外の信号が重畳した音声信号であっても、当該話者の音声の話速を正確に算出して、ユーザが所望する再生音声の話速変換を行うことが可能な音声速度変換装置、音声速度変換方法、および音声速度変換プログラムを提供することである。 Therefore, an object of the present invention is to accurately calculate the speech speed of the speaker's voice even if the voice signal is a signal other than the voice superimposed on the voice of the speaker subject to speech speed conversion. Another object of the present invention is to provide an audio speed conversion apparatus, an audio speed conversion method, and an audio speed conversion program capable of converting the speech speed of reproduced audio desired by a user.

第１の発明は、話速変換の対象となる話者の音声が含まれる音声信号を話速変換して再生する音声速度変換装置であって、音声信号から話者の音声が含まれる音声区間と当該話者の音声が含まれない非音声区間とを区別し、当該非音声区間を検出する非音声区間検出部と、非音声区間検出部が音声信号から検出した非音声区間それぞれに対する時間長を計測する非音声区間長計測部と、非音声区間長計測部で計測された非音声区間それぞれの時間長に基づいて、音声信号における話者の発話速度を算出する発話速度算出部と、発話速度算出部が算出した発話速度に応じて、音声信号を話速変換して再生する話速変換再生部とを備える。 A first aspect of the present invention is an audio speed conversion apparatus for converting an audio signal including a voice of a speaker to be converted into an audio speed and reproducing the audio signal, and an audio section including the voice of the speaker from the audio signal And a non-speech segment that does not include the voice of the speaker, and a non-speech segment detector that detects the non-speech segment and a time length for each non-speech segment detected from the speech signal by the non-speech segment detector A non-speech segment length measurement unit that measures the speech rate, a speech rate calculation unit that computes the speech rate of the speaker in the speech signal based on the time length of each non-speech segment measured by the non-speech segment length measurement unit, A speech speed converting / reproducing unit for converting the speech signal to be reproduced according to the speech rate calculated by the speed calculating unit;

第２の発明は、第１の発明に従属する音声速度変換装置であって、発話速度算出部は、非音声区間長計測部が計測した非音声区間に対する時間長の生起頻度を算出する生起頻度算出部を含み、発話速度算出部は、予め求められた発話速度と非音声区間の時間長との統計的な関係式に基づいて、生起頻度において最大度数となる時間長に応じた話者の発話速度を算出する。 A second aspect of the invention is an audio speed conversion device according to the first aspect of the invention, wherein the speech speed calculation unit calculates the occurrence frequency of the time length for the non-speech section measured by the non-speech section length measurement unit. The speech rate calculation unit includes a calculation unit, based on a statistical relational expression between the speech rate obtained in advance and the time length of the non-speech interval, and the speaker's speed corresponding to the maximum frequency in the occurrence frequency Calculate the speaking rate.

第３の発明は、第２の発明に従属する音声速度変換装置であって、予め設定された複数の区分毎に統計的に求められた非音声区間の時間長の統計データを格納する統計データ格納部をさらに備え、発話速度算出部は、統計データ格納部に格納された統計データに基づいて、非音声区間長計測部が計測した非音声区間の時間長に応じて当該非音声区間を複数の区分毎に分類し、所定の条件に基づいて当該複数の区分から１つの区分を選択する非音声区間分類部を含み、生起頻度算出部は、非音声区間分類部が選択した区分に属する非音声区間の時間長を用いて生起頻度を算出する。 A third aspect of the present invention is an audio speed conversion device subordinate to the second aspect of the present invention, which stores statistical data for time length statistical data of a non-voice interval that is statistically determined for each of a plurality of preset segments. The speech rate calculating unit further includes a plurality of non-speech segments according to the time length of the non-speech segment measured by the non-speech segment length measurement unit based on the statistical data stored in the statistical data storage unit. And a non-speech segment classification unit that selects one segment from the plurality of segments based on a predetermined condition, and the occurrence frequency calculation unit is a non-speech segment belonging to the category selected by the non-speech segment classification unit The occurrence frequency is calculated using the time length of the speech section.

第４の発明は、第３の発明に従属する音声速度変換装置であって、統計データ格納部は、複数に分類された発話速度に応じて複数の区分を設定し、当該区分毎に統計データを格納しており、非音声区間分類部は、発話速度が分類された区分毎の統計データに基づいて、非音声区間長計測部が計測した非音声区間の時間長に応じて当該非音声区間を区分毎に分類し、当該区分の中で最も多く非音声区間を分類した区分を選択する。 A fourth aspect of the invention is an audio speed conversion device subordinate to the third aspect of the invention, wherein the statistical data storage unit sets a plurality of sections according to the speech speeds classified into a plurality, and statistical data for each of the sections. And the non-speech segment classification unit is configured to perform the non-speech segment according to the time length of the non-speech segment measured by the non-speech segment length measurement unit based on the statistical data for each category in which the speech speed is classified. Are classified for each category, and the category in which the most non-voice segments are classified is selected.

第５の発明は、第３の発明に従属する音声速度変換装置であって、統計データ格納部は、読点直後に生じる非音声区間の時間長を統計的に求めた第１の区分に対する統計データと、句点直後に生じる非音声区間の時間長を統計的に求めた第２の区分に対する統計データとを格納しており、非音声区間分類部は、第１の区分および第２の区分それぞれに対する統計データに基づいて、非音声区間長計測部が計測した非音声区間の時間長に応じて当該非音声区間を区分毎に分類して当該第１の区分を選択し、生起頻度算出部は、第１の区分に属する非音声区間の時間長を用いて生起頻度を算出する。 A fifth aspect of the invention is an audio speed conversion device according to the third aspect of the invention, in which the statistical data storage unit statistical data for the first section obtained by statistically obtaining the time length of the non-speech interval that occurs immediately after the reading. And statistical data for the second segment obtained by statistically calculating the time length of the non-speech interval that occurs immediately after the punctuation point, and the non-speech segment classification unit is configured for each of the first segment and the second segment. Based on the statistical data, according to the time length of the non-speech segment measured by the non-speech segment length measurement unit, classify the non-speech segment for each segment and select the first segment, and the occurrence frequency calculation unit, The occurrence frequency is calculated using the time length of the non-speech section belonging to the first category.

第６の発明は、第３の発明に従属する音声速度変換装置であって、統計データ格納部は、複数に分類された発話速度の区分と、読点直後に生じる第１の区分および句点直後に生じる第２の区分との組み合わせによって複数の区分を設定し、当該区分毎に統計的に求められた非音声区間の時間長の統計データを格納しており、非音声区間分類部は、複数の区分毎の統計データに基づいて、非音声区間長計測部が計測した非音声区間の時間長に応じて当該非音声区間を区分毎に分類し、複数に分類された発話速度に応じた区分と第１の区分との組み合わせの中で最も多く非音声区間を分類した区分を抽出して発話速度に対する区分を決定して、当該決定された発話速度に対する区分と第１の区分との組み合わせとなる区分を選択する。 A sixth aspect of the invention is an audio speed conversion device according to the third aspect of the invention, in which the statistical data storage unit includes a plurality of categorized utterance speeds, a first categorization that occurs immediately after reading, and a punctuation immediately after a phrase. A plurality of sections are set according to the combination with the generated second section, and statistical data of the length of the non-speech section statistically obtained for each section is stored, and the non-speech section classification unit includes a plurality of sections Based on the statistical data for each category, the non-speech segment is classified for each category according to the time length of the non-speech segment measured by the non-speech segment length measurement unit, and a category corresponding to the speech rate classified into a plurality of categories Of the combinations with the first segment, the segment that classifies the most non-voice segments is extracted to determine the segment for the speech rate, and the segment for the determined speech rate and the first segment are combined. Select a category.

第７の発明は、第３の発明に従属する音声速度変換装置であって、統計データ格納部は、話者の特性それぞれに応じて予め複数の区分を設定し、当該複数の区分毎に統計的に求められた非音声区間の時間長の統計データを格納する。 A seventh aspect of the invention is an audio speed conversion device subordinate to the third aspect of the invention, wherein the statistical data storage unit sets a plurality of sections in advance according to the characteristics of the speaker, and the statistics for each of the plurality of sections. The statistical data of the time length of the non-speech interval obtained automatically is stored.

第８の発明は、第１の発明に従属する音声速度変換装置であって、発話速度算出部は、非音声区間長計測部で計測された時間長を用いて、非音声区間検出部が検出した非音声区間を複数の群に分類し、当該複数の群から１つを選択する非音声区間分類部と、非音声区間分類部が選択した群に属する非音声区間の時間長を用いて生起頻度を算出する生起頻度算出部と、予め求められた発話速度と非音声区間の時間長との統計的な関係式に基づいて、生起頻度において最大度数となる時間長に応じた話者の発話速度を算出する発話速度換算部とを含む。 An eighth invention is an audio speed conversion device according to the first invention, wherein the speech rate calculation unit is detected by the non-speech segment detection unit using the time length measured by the non-speech segment length measurement unit. The non-speech segment is classified into a plurality of groups, and a non-speech segment classification unit that selects one from the plurality of groups, and the time length of the non-speech segment belonging to the group selected by the non-speech segment classification unit are generated. The utterance of the speaker according to the time length that is the maximum frequency in the occurrence frequency based on the occurrence frequency calculation unit that calculates the frequency and the statistical relational expression between the utterance speed and the time length of the non-speech interval obtained in advance An utterance speed conversion unit for calculating the speed.

第９の発明は、第１の発明に従属する音声速度変換装置であって、表示部と、話速変換再生部が音声信号を話速変換して再生する再生時間を算出し、当該再生時間を示す情報を表示部に表示する再生時間算出部とをさらに備える。 A ninth invention is an audio speed conversion device subordinate to the first invention, wherein the display unit and the speech speed conversion / playback unit calculate a playback time for converting the voice signal to be played back, and the playback time is calculated. And a reproduction time calculation unit that displays information indicating the above on the display unit.

第１０の発明は、第１の発明に従属する音声速度変換装置であって、表示部と、話速変換再生部が音声信号を話速変換して再生する再生速度を算出し、当該再生速度を示す情報を表示部に表示する再生速度算出部とをさらに備える。 A tenth aspect of the present invention is an audio speed conversion device subordinate to the first aspect of the present invention, wherein the display unit and the speech speed conversion / reproduction unit calculate the reproduction speed at which the speech signal is converted by the speech speed and reproduce the same. And a reproduction speed calculation unit that displays information indicating the above on the display unit.

第１１の発明は、話速変換の対象となる話者の音声が含まれる音声信号を話速変換して再生する音声速度変換方法であって、音声信号から話者の音声が含まれる音声区間と当該話者の音声が含まれない非音声区間とを区別し、当該非音声区間を検出する非音声区間検出ステップと、非音声区間検出ステップが所定時間分の音声信号から検出した非音声区間それぞれに対する時間長を計測する非音声区間長計測ステップと、非音声区間長計測ステップで計測された非音声区間それぞれの時間長に基づいて、音声信号における話者の発話速度を算出する発話速度算出ステップと、発話速度算出ステップが算出した発話速度に応じて、音声信号を話速変換して再生する話速変換再生ステップとを含む。 An eleventh aspect of the present invention is an audio speed conversion method for reproducing an audio signal including a voice of a speaker to be subjected to an audio speed conversion by reproducing the audio speed, wherein the audio section includes the voice of the speaker from the audio signal. And a non-speech segment detected from the speech signal for a predetermined time by a non-speech segment detection step for detecting the non-speech segment and the non-speech segment detection step A non-speech interval length measurement step that measures the time length for each, and a speech rate calculation that calculates the speaking rate of the speaker in the speech signal based on the time length of each non-speech interval measured in the non-speech interval length measurement step And a speech speed converting / reproducing step of reproducing the speech signal by converting the speech speed according to the speech speed calculated by the speech speed calculating step.

第１２の発明は、話速変換の対象となる話者の音声が含まれる音声信号を話速変換して再生する音声速度変換装置のコンピュータに実行される音声速度変換プログラムであって、コンピュータに、音声信号から話者の音声が含まれる音声区間と当該話者の音声が含まれない非音声区間とを区別し、当該非音声区間を検出する非音声区間検出ステップと、非音声区間検出ステップが所定時間分の音声信号から検出した非音声区間それぞれに対する時間長を計測する非音声区間長計測ステップと、非音声区間長計測ステップで計測された非音声区間それぞれの時間長に基づいて、音声信号における話者の発話速度を算出する発話速度算出ステップと、発話速度算出ステップが算出した発話速度に応じて、音声信号を話速変換して再生する話速変換再生ステップとを実行させる。 A twelfth aspect of the invention is an audio speed conversion program executed by a computer of an audio speed conversion apparatus that converts an audio signal including a voice of a speaker to be converted into an audio speed and reproduces the audio signal. A non-speech section detecting step for distinguishing between a speech section in which a speaker's voice is included from a speech signal and a non-speech section in which the speaker's speech is not included, and detecting the non-speech section; and a non-speech section detecting step Based on the time length of each non-speech interval measured in the non-speech segment length measurement step and the non-speech segment length measurement step for measuring the length of time for each non-speech segment detected from the speech signal for a predetermined time. An utterance speed calculation step for calculating the utterance speed of the speaker in the signal, and an utterance speed conversion for reproducing the speech signal according to the utterance speed calculated by the utterance speed calculation step. To execute and raw step.

上記第１の発明によれば、音声信号に含まれる非音声区間の時間長に基づいて音声信号における話者の発話速度を算出するため、話速変換の対象となる話者の音声に当該音声以外の信号が重畳した音声信号であっても当該話者の音声の発話速度を正確に算出できる。 According to the first aspect of the invention, since the speaking rate of the speaker in the speech signal is calculated based on the time length of the non-speech interval included in the speech signal, the speech is converted into the speech of the speaker that is subject to speech speed conversion. The speech rate of the speaker's speech can be accurately calculated even if the speech signal is a signal superimposed with other signals.

上記第２の発明によれば、音声信号から検出された各非音声区間の時間長の生起頻度を算出し、予め求められた統計的な関係式に基づいて、当該生起頻度の最大度数となる時間長に応じて発話速度を算出するため、バラツキを有する非音声区間の時間長に対して正確な発話速度を算出することができる。 According to the second aspect, the occurrence frequency of the time length of each non-speech interval detected from the audio signal is calculated, and the maximum frequency of the occurrence frequency is obtained based on a statistical relational expression obtained in advance. Since the speech rate is calculated according to the time length, it is possible to calculate an accurate speech rate with respect to the time length of the non-speech section having variations.

上記第３の発明によれば、音声信号から検出された各非音声区間の時間長を予め設定された区分に分類することで、イレギュラーな時間長データを除くことができ、より正確な発話速度を算出することができる。 According to the third invention, irregular time length data can be removed by classifying the time length of each non-speech segment detected from the speech signal into a preset category, and more accurate speech The speed can be calculated.

上記第４の発明によれば、音声信号から検出された各非音声区間を発話速度の区分（例えば、速い、普通、遅いで設定される３区分）に分類し、分類された非音声区間が最も多い区分を選択することで、正確な発話速度を算出するために適した非音声区間を選別することができる。 According to the fourth aspect, each non-speech segment detected from the speech signal is classified into speech speed categories (for example, three categories set as fast, normal, and slow). By selecting the most segment, it is possible to select a non-voice segment suitable for calculating an accurate speech rate.

上記第５の発明によれば、読点直後に生じる非音声区間の時間長と発話速度とは相関が高い関係にあるため、読点直後を示す第１の区分に属する非音声区間の時間長のみを用いることによって、さらに精度良く正確な発話速度を算出することができる。 According to the fifth aspect, since the time length of the non-speech section that occurs immediately after the reading and the utterance speed are highly correlated, only the time length of the non-speech section that belongs to the first section indicating immediately after the reading is obtained. By using this, it is possible to calculate an accurate speech rate with higher accuracy.

上記第６の発明によれば、発話速度に対する大まかな区別を行うために相関性が高い句点直後に生じる非音声区間の時間長の特性を利用して、句点直後を示す第２の区分に属する数を用いて発話速度の区分に分類し、その後、読点直後に生じる非音声区間の時間長と発話速度とが相関が高い関係にある特性を用いて、読点直後を示す第１の区分に属する非音声区間の時間長のみを用いることによって、さらに精度良く正確な発話速度を算出することができる。 According to the sixth aspect of the invention, using the characteristic of the time length of the non-speech section that occurs immediately after a highly-correlated phrase in order to roughly distinguish the speech rate, it belongs to the second category indicating immediately after the phrase. The number is used to classify the speech rate into categories, and then the non-speech interval time length that occurs immediately after the reading and the speech rate belong to the first category indicating the immediately after the reading using the characteristic that is highly correlated. By using only the time length of the non-speech section, it is possible to calculate the speech rate more accurately and accurately.

上記第７の発明によれば、音声信号から検出された各非音声区間が話者の特性それぞれに応じた複数の区分に分類することによって話速変換の対象となる話者の特性に応じた最適な発話速度を算出することができる。 According to the seventh aspect, each non-speech segment detected from the speech signal is classified into a plurality of sections corresponding to the characteristics of the speaker, and according to the characteristics of the speaker to be subjected to the speech speed conversion. An optimal speech rate can be calculated.

上記第８の発明によれば、複数の群を設定し、当該群毎の統計的に求められた統計データに基づいて、音声信号から検出された各非音声区間の時間長を当該群に分類し、分類された非音声区間が最も多い群を選択することで、発話速度を算出するために適切な非音声区間を選別することができる。 According to the eighth aspect of the invention, a plurality of groups are set, and the time length of each non-speech interval detected from the speech signal is classified into the group based on statistical data statistically obtained for each group. Then, by selecting a group having the largest number of classified non-speech segments, it is possible to select an appropriate non-speech segment in order to calculate the speech rate.

上記第９の発明によれば、ユーザが所望する再生音声の発話速度を入力すれば、事前に音声信号の再生時間が把握できる。 According to the ninth aspect of the present invention, the reproduction time of the audio signal can be grasped in advance by inputting the utterance speed of the reproduction audio desired by the user.

上記第１０の発明によれば、ユーザが所望する音声信号の再生時間を入力すれば、事前に話速変換後の再生音声の発話速度が把握できる。 According to the tenth aspect of the present invention, the speech rate of the reproduced speech after the speech rate conversion can be grasped in advance by inputting the playback time of the speech signal desired by the user.

また、本発明の音声速度変換方法および音声速度変換プログラムによれば、上述した音声速度変換装置と同様の効果が得られる。 Further, according to the audio speed conversion method and the audio speed conversion program of the present invention, the same effects as those of the above-described audio speed conversion device can be obtained.

本発明に係る実施の形態を説明する前に、本発明の概念について図１〜図５を用いて説明する。なお、音声が発声する速さ（話速）は、本発明において単位時間あたりのモーラ数またはその逆数で定義し、「発話速度」と呼ぶ。ここでモーラは、仮名文字単位に相当する。また、話速変換の対象となる話者の音声が含まれる区間を音声区間とし、当該話者の音声が含まれない区間を非音声区間とする。そして、当該非音声区間を「ポーズ」と呼ぶ。 Before describing the embodiment of the present invention, the concept of the present invention will be described with reference to FIGS. In addition, the speed (speaking speed) at which the voice is uttered is defined by the number of mora per unit time or the inverse thereof in the present invention, and is referred to as “speech speed”. Here, the mora corresponds to a kana character unit. Further, a section including the voice of the speaker to be subject to speech speed conversion is defined as a voice section, and a section not including the speaker's voice is defined as a non-voice section. The non-voice segment is called “pause”.

一般的に、ポーズと発話速度との関係において、次に説明する関係が知られている。第１に、同一文章内のポーズの数の合計は、発話速度が速いほど少なくなるという関係がある。第２に、同一文章内のポーズの時間長（以下、ポーズ長という）の合計は、発話速度が速いほど短くなるという関係がある。第３に、同一の発話速度において、ポーズ長はそのポーズ近傍の音声の語句属性（例えば、文章内における句読点）によって異なるという関係がある。 In general, the relationship described below is known as the relationship between pause and speech rate. First, there is a relationship that the total number of pauses in the same sentence decreases as the utterance speed increases. Secondly, there is a relationship in which the total length of pauses (hereinafter referred to as pause length) in the same sentence becomes shorter as the speech rate is higher. Third, at the same speech rate, there is a relationship that the pause length varies depending on the phrase attribute (for example, punctuation in a sentence) of the voice near the pause.

また、発話速度の違う区分によるポーズ長の傾向を示したものとして、正木、外２名、平成１４年日本音響学会春季研究発表会、「物語朗読における異なる話速と発話スタイル間の発話時間長制御について」、日本音響学会講演論文集、日本音響学会、２００２年３月、２−１０−１７、ｐ．２９７−２９８（以下、文献１と記載する）がある。以下、上記文献１に記載された概略を説明する。 In addition, Masaki and two others, 2002 Acoustical Society of Japan Spring Research Presentation, “Speech duration between different speech speeds and utterance styles in narrative reading” About Control ", Acoustical Society of Japan Proceedings, Acoustical Society of Japan, March 2002, 2-10-17, p. 297-298 (hereinafter referred to as Document 1). Hereinafter, the outline described in Document 1 will be described.

上記文献１では、物語朗読を異なる発話速度の区分（速い、普通、遅い）で収録し、発話速度の区分によるポーズ長別のポーズの生起頻度を分析している。図１は、発話速度の区分（速い、普通、遅い）によるポーズ長別のポーズに対する生起頻度の一例を示す図である。図１では、縦軸はポーズの生起頻度を度数で示し、横軸はポーズのポーズ長をミリ秒で示す。図１において、各発話速度において生起頻度が最大値となるポーズ長は、発話速度が速いほど短くなる傾向がある。 In the above-mentioned document 1, story readings are recorded in different utterance speed categories (fast, normal, and slow), and the occurrence frequency of poses by pose length according to the utterance speed category is analyzed. FIG. 1 is a diagram showing an example of the occurrence frequency for pauses according to pause lengths according to speech speed classification (fast, normal, slow). In FIG. 1, the vertical axis indicates the frequency of pose occurrence in degrees, and the horizontal axis indicates the pose length in milliseconds. In FIG. 1, the pause length at which the occurrence frequency becomes the maximum value at each utterance speed tends to be shorter as the utterance speed is higher.

ここで、ポーズ近傍の音声の語句属性である文章内の句点「。」および読点「、」に着目し、上記ポーズを句点直後の区分のポーズと読点直後の区分のポーズに分類する。図２は、音声の時間軸波形におけるポーズの一例を示す模式図である。図２に示されるように、句点の語句属性を「文間」と区分し、句点直後のポーズを「文間ポーズ」とする。また、読点の語句属性を「文内」と区分し、読点直後のポーズを「文内ポーズ」とする。そして、発話速度の区分による文内ポーズおよび文間ポーズの各ポーズ長の傾向をそれぞれ分析する。 Here, focusing on the phrase “.” And the reading “,” in the sentence, which are the phrase attributes of the speech in the vicinity of the pose, the pose is classified into the pose of the section immediately after the punctuation and the pose of the section immediately after the reading. FIG. 2 is a schematic diagram illustrating an example of a pause in a time axis waveform of a voice. As shown in FIG. 2, the phrase attribute of a phrase is classified as “between sentences”, and the pose immediately after the phrase is set as “between sentences”. In addition, the word attribute of the punctuation mark is classified as “within sentence”, and the pose immediately after the punctuation mark is set as “within sentence sentence”. Then, the tendency of each pose length of the intra-sentence pose and the inter-sentence pose according to the utterance speed category is analyzed.

まず、発話速度の区分による文内ポーズのポーズ長の傾向について、図３および図４を用いて説明する。図３は、発話速度の区分（速い、普通、遅い）毎の文内ポーズに対する生起頻度の一例を示す図である。図３において、発話速度が遅い場合では、生起頻度の最大点は他の発話速度と比べてポーズ長が長い傾向にある。また、発話速度が速い場合では、生起頻度の最大点は他の発話速度と比べて、ポーズ長が短い傾向にある。 First, the tendency of the pose length of the in-sentence pose according to the utterance speed will be described with reference to FIGS. FIG. 3 is a diagram showing an example of the occurrence frequency for the sentence pause for each utterance speed category (fast, normal, slow). In FIG. 3, when the utterance speed is low, the maximum frequency of occurrence tends to have a longer pause length than other utterance speeds. When the speaking rate is high, the maximum point of occurrence frequency tends to have a shorter pause length than other speaking rates.

また、図４は、文内ポーズのポーズ長を各発話速度で除算し、ポーズ長をミリ秒ではなく、モーラ単位で示した図である。図４において、異なる発話速度であっても、文内ポーズにおける生起頻度の最大点のポーズ長は、２モーラ付近で一致している。つまり、モーラ単位で示される文内ポーズのポーズ長は、発話速度に関係なく一致する傾向が強い。つまり、文内ポーズのポーズ長が決定されれば、当該ポーズ長から発話速度を算出することが可能となる。 FIG. 4 is a diagram in which the pose length of the in-sentence pose is divided by each utterance speed, and the pose length is shown in units of mora instead of milliseconds. In FIG. 4, even at different utterance speeds, the pose length of the maximum occurrence frequency in the sentence pose is the same in the vicinity of 2 mora. In other words, the pose lengths of the in-sentence poses shown in units of mora are more likely to match regardless of the speaking speed. That is, if the pose length of the in-sentence pose is determined, the speech rate can be calculated from the pose length.

次に、発話速度の区分による文間ポーズのポーズ長の傾向について、図５を用いて説明する。図５は、発話速度の区分（速い、普通、遅い）毎の文間ポーズに対する生起頻度の一例を示す図である。図５において、発話速度が遅い場合では、生起頻度の最大点は他の発話速度と比べてポーズ長が長い傾向にある。また、発話速度が速い場合では、生起頻度分布の最大点は他の発話速度と比べて、ポーズ長が短い傾向にある。 Next, the tendency of the pause length of the pause between sentences depending on the speech speed will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of an occurrence frequency for a sentence pause for each utterance speed category (fast, normal, or slow). In FIG. 5, when the utterance speed is low, the maximum point of occurrence frequency tends to have a longer pause length than other utterance speeds. When the speech rate is fast, the maximum point of the occurrence frequency distribution tends to have a shorter pause length than other speech rates.

ここで、発話速度の区分毎の生起頻度に対する最大点となる各ポーズ長の差は、文間ポーズのほうが文内ポーズ（図３参照）に比べて大きい。例えば、発話速度が速い場合と遅い場合とを比較すると、文間ポーズのほうが各生起頻度の最大点となるポーズ長の差が大きくなる。つまり、文間ポーズは、文内ポーズと比べて発話速度に対するポーズ長の傾向がより顕著に見られ、発話速度を大まかに分類する用途に適していることがわかる。 Here, the difference between each pose length, which is the maximum point with respect to the occurrence frequency for each category of speech speed, is larger in the inter-sentence pose than in the intra-sentence pose (see FIG. 3). For example, comparing the case where the speech rate is fast and the case where the speech rate is slow, the difference between pause lengths, which becomes the maximum point of each occurrence frequency, is greater in the pause between sentences. That is, it can be seen that the pause between sentences shows a tendency of the pause length with respect to the utterance speed more significantly than the pose within the sentence, and is suitable for the purpose of roughly classifying the utterance speed.

以上のように、ポーズ近傍の音声の語句属性によって区分される文間および文内ポーズは、その各ポーズ長と発話速度との相関が高く、音声の発話速度を算出する上で有益な情報である。またポーズ長は、非音声区間の時間長であるため、音声に音声以外の他の信号が重畳した音声信号であっても正確に検出できる。そこで、本発明では、ポーズ近傍の音声の語句属性によって区分されるポーズに着目し、当該ポーズのポーズ長に基づいて音声の発話速度を算出する。 As described above, inter-sentence and intra-sentence poses classified by phrase attributes of speech near the pose have high correlation between the length of each pose and the speech speed, and are useful information for calculating the speech speech speed. is there. Further, since the pause length is the time length of the non-speech interval, it can be accurately detected even for a speech signal in which a signal other than speech is superimposed on speech. Therefore, in the present invention, paying attention to the poses classified by the phrase attributes of the speech near the pose, the speech utterance speed is calculated based on the pose length of the pose.

（第１の実施形態）
以下に、本発明に係る第１の実施形態について図６を用いて説明する。図６は、本発明の第１の実施形態に係る音声速度変換装置１を示すブロック図である。図６において、音声速度変換装置１は、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、およびスピーカ１７とを備える。 (First embodiment)
The first embodiment according to the present invention will be described below with reference to FIG. FIG. 6 is a block diagram showing the audio speed conversion device 1 according to the first embodiment of the present invention. In FIG. 6, the voice speed conversion apparatus 1 includes a voice signal storage unit 11, a pause detection unit 12, a statistical data storage unit 13, a speech speed calculation unit 14, a speech speed control unit 15, a speed conversion unit 16, and a speaker 17. Prepare.

音声信号格納部１１には、音声信号が格納される。ここで、音声信号とは、ユーザが話速変換を所望するその話速変換の対象となる話者の音声が含まれる信号を意味する。なお、音声信号には、少なくとも上記対象となる話者の音声が含まれていれば良い。つまり、上記音声信号は、対象となる話者の音声（例えば会話など）のみで構成される音声信号であっても良いし、当該対象となる話者の音声に当該音声以外の他の信号が重畳した音声信号（例えば、テレビ番組、ラジオ番組、記録媒体に記録された映画などの音声信号）であっても良い。 The audio signal storage unit 11 stores an audio signal. Here, the voice signal means a signal including the voice of the speaker who is the target of the speech speed conversion for which the user desires the speech speed conversion. Note that it is only necessary that the voice signal includes at least the voice of the target speaker. That is, the voice signal may be a voice signal composed only of the voice of the target speaker (for example, conversation), or other signals other than the voice may be included in the voice of the target speaker. It may be a superimposed audio signal (for example, an audio signal of a television program, a radio program, a movie recorded on a recording medium, etc.).

また、上記対象となる話者は、一人の話者に限らず、複数の話者であっても良い。例えば会話などは、複数の話者の音声で構成される。そして、当該複数の話者の音声は、何れも後述するポーズ検出部１２で音声区間として判定される。 Further, the target speaker is not limited to one speaker, and may be a plurality of speakers. For example, a conversation or the like is composed of a plurality of speaker voices. Then, the voices of the plurality of speakers are all determined as voice sections by the pause detection unit 12 described later.

また、音声信号は、例えば通信媒体を介して、音声信号格納部１１に供給してもかまわない。または、記録媒体（例えば、光ディスクなど）に記録された音声信号を音声信号格納部１１に供給してもかまわない。 Further, the audio signal may be supplied to the audio signal storage unit 11 via a communication medium, for example. Alternatively, an audio signal recorded on a recording medium (for example, an optical disk) may be supplied to the audio signal storage unit 11.

ポーズ検出部１２は、音声信号格納部１１に格納された音声信号から音声区間と非音声区間とを判定する。そして、判定された非音声区間をポーズとして検出する。統計データ格納部１３には、予め求められているポーズ長に関する統計データが格納される。発話速度算出部１４は、ポーズ検出部１２で検出されたポーズおよび統計データ格納部１３に格納された統計データに基づいて発話速度を算出する。発話速度制御部１５は、発話速度算出部１４で算出された発話速度に応じて、予め設定された制御方法によって圧縮伸長率を算出する。速度変換部１６は、発話速度制御部１５で算出された圧縮伸長率に基づいて音声信号格納部１１に格納された音声信号を時間軸上にて圧縮伸長することにより、話速変換を行う。そして、話速変換された音声信号は、スピーカ１７から再生される。以下に、音声速度変換装置１における各構成部の機能について、詳細に説明する。 The pause detection unit 12 determines a speech segment and a non-speech segment from the speech signal stored in the speech signal storage unit 11. Then, the determined non-voice section is detected as a pause. The statistical data storage unit 13 stores statistical data relating to a previously determined pause length. The utterance speed calculation unit 14 calculates the utterance speed based on the pose detected by the pose detection unit 12 and the statistical data stored in the statistical data storage unit 13. The speech rate control unit 15 calculates the compression / expansion rate by a preset control method according to the speech rate calculated by the speech rate calculation unit 14. The speed conversion unit 16 performs speech speed conversion by compressing and expanding the audio signal stored in the audio signal storage unit 11 on the time axis based on the compression / expansion rate calculated by the speech rate control unit 15. Then, the voice signal whose speech speed has been converted is reproduced from the speaker 17. Below, the function of each component in the audio speed conversion apparatus 1 will be described in detail.

ポーズ検出部１２は、音声信号格納部１１に格納された音声信号を検出単位として設定されたフレーム（１フレームの時間長を時間Ｔｆとする）に分割し、当該フレーム毎に音声区間および非音声区間を判定する。そして、判定された非音声区間をポーズとして検出する。図７は、１フレーム分の音声信号における時間軸波形の一例を示す図である。図７において、縦軸はレベルを示し、横軸は時間を示す。なお、図７に示す時間軸波形において、ポーズ検出部１２が判定した音声区間をＴｏ１〜Ｔｏ６とし、非音声区間をＴｐ１〜Ｔｐ５とする。 The pause detection unit 12 divides the audio signal stored in the audio signal storage unit 11 into frames (the time length of one frame is a time Tf) set as a detection unit, and an audio section and a non-audio for each frame. Determine the interval. Then, the determined non-voice section is detected as a pause. FIG. 7 is a diagram illustrating an example of a time axis waveform in an audio signal for one frame. In FIG. 7, the vertical axis indicates the level, and the horizontal axis indicates time. In the time axis waveform shown in FIG. 7, the speech sections determined by the pause detection unit 12 are To1 to To6, and the non-speech sections are Tp1 to Tp5.

ここで、ポーズ検出部１２が用いる音声区間および非音声区間の判定方法として、例えば、文献２に記載されたベイズ関数を用いて音声信号を対象となる話者の音声の区間と当該音声以外の他の信号の区間とに判定する方法が知られている（中島康之、他４名、「ＭＰＥＧ符号化データからのオーディオインデキシング」、電子情報通信学会論文誌、電子情報通信学会、２０００年５月、Ｄ−II Ｖｏｌ．Ｊ８３−Ｄ−II Ｎｏ．５、ｐ．１３
６１−１３７１、以下、文献２という）。本実施形態では、ポーズ検出部１２は、上記文献２の方法を用いて音声信号の時間軸波形から、対象となる話者の音声の区間を音声区間（例えば図７における音声区間Ｔｏ１〜Ｔｏ６）と、当該音声以外の他の信号の区間を非音声区間（例えば図７における非音声区間Ｔｐ１〜Ｔｐ５）とを判定する。そして、判定された当該非音声区間をポーズとして検出する。 Here, as a method of determining the speech section and the non-speech section used by the pause detection unit 12, for example, using the Bayes function described in Document 2, the speech section of the target speaker and the speech other than the speech There are known methods for determining other signal sections (Yasuyuki Nakajima, 4 others, "Audio indexing from MPEG encoded data", IEICE Transactions, IEICE, May 2000). D-II Vol.J83-D-II No. 5, p.13
61-1371, hereinafter referred to as Document 2). In the present embodiment, the pause detection unit 12 uses the method of Document 2 above to determine the speech segment of the target speaker from the time axis waveform of the speech signal as a speech segment (for example, speech segments To1 to To6 in FIG. 7). Then, a section of a signal other than the voice is determined as a non-speech section (for example, non-speech sections Tp1 to Tp5 in FIG. 7). Then, the determined non-voice section is detected as a pause.

なお、音声区間および非音声区間の判定方法は、上記文献２の方法に限定されず、他の方法を用いても良い。他の方法の一例として、テレビ番組および映画などの信号には、音声信号以外に映像信号も含まれる。このとき、当該映像信号を用いて、対象となる話者の音声が発声される際に動く発声器官（例えば対象となる話者の唇、顎および声帯など）の映像情報を認識する。そして、当該認識結果を用いて対象となる話者が発声しているか否かを判断し、発声していれば音声区間と、それ以外は非音声区間と判定することで、ポーズを検出するという方法を用いても良い。 In addition, the determination method of an audio | voice area and a non-voice area is not limited to the method of the said literature 2, You may use another method. As an example of another method, signals such as television programs and movies include video signals in addition to audio signals. At this time, using the video signal, video information of a speech organ that moves when the voice of the target speaker is uttered (for example, the lips, jaws, and vocal cords of the target speaker) is recognized. Then, using the recognition result, it is determined whether or not the target speaker is speaking, and if it is speaking, it is determined to be a speech section, and otherwise, it is determined to be a non-speech section, thereby detecting a pause. A method may be used.

統計データ格納部１３は、予め求められたポーズがもつ属性（以下、クラスという）におけるポーズ長の統計データが格納される。図８は、各クラスのポーズ長に対する統計データの一例を示す図である。クラスは、発話速度属性および語句属性の２つの属性からなるとする。ここで、発話速度属性とは、速い、普通、および遅いなどに区分される発話速度に関する属性を示す。語句属性とは、文間および文内などに区分される語句に関する属性を示す。つまり、クラスは、発話速度属性の区分および語句属性の区分による２つの組み合わせとなる。具体的には、図８において、クラスは、例えば語句属性の区分となる「文内」と発話速度属性の区分となる「速い」との組み合わせを１クラスとし、全部で３（発話速度属性）＊２（語句属性）の計６クラスあることになる。そして、各クラスにおいてポーズ長の統計データがそれぞれ設定される。例えば、統計データは、ポーズ長の平均値および標準偏差が予めクラス毎に統計的に求められたデータである。 The statistical data storage unit 13 stores pose length statistical data for attributes (hereinafter referred to as classes) of poses obtained in advance. FIG. 8 is a diagram showing an example of statistical data for the pause length of each class. It is assumed that the class is composed of two attributes, an utterance speed attribute and a phrase attribute. Here, the utterance speed attribute indicates an attribute related to the utterance speed classified into fast, normal, and slow. The phrase attribute refers to an attribute relating to a phrase divided between sentences and within a sentence. That is, the class is a combination of two according to the speech speed attribute classification and the phrase attribute classification. Specifically, in FIG. 8, for example, a class is a combination of “in the sentence” that is a category of phrase attributes and “fast” that is a category of speech rate attributes, and a total of 3 (speech rate attributes) * There are 6 classes in total, 2 (word attributes). Then, statistical data of pause length is set for each class. For example, the statistical data is data in which the average value and the standard deviation of the pause length are statistically obtained for each class in advance.

なお、上記統計データは、読み上げ調であるか、ぞんざいであるかなどの話者の発話スタイルや特性に基づいた統計データをそれぞれ用意しても良い。また、話者情報、スポーツ番組、またはドラマなどのテレビ番組ジャンル別に基づいた統計データをそれぞれ用意しても良い。このとき、ユーザが話速変換を所望する音声信号の内容に合わせて、ユーザが統計データを選択しても良いし、テレビ番組放送等のＥＰＧ情報に基づいて最適な統計データが自動的に選択されるようにしても良い。 Note that the statistical data may be prepared based on the speaker's utterance style and characteristics such as whether it is in a reading style or not. Further, statistical data based on TV program genres such as speaker information, sports programs, or dramas may be prepared. At this time, the user may select statistical data according to the content of the audio signal for which the user wants to convert the speech speed, or the optimum statistical data is automatically selected based on EPG information such as TV program broadcast. You may be made to do.

発話速度算出部１４は、ポーズ検出部１２から検出された現在のフレームを含む当該フレーム以前のＮ（Ｎは自然数）フレームについて、当該Ｎフレームに含まれる各ポーズのポーズ長および統計データ格納部１３に格納された統計データに基づいて、当該Ｎフレームの発話速度を算出し、当該Ｎフレームの発話速度を現在のフレームの発話速度とする。ここで、Ｎフレームとは、Ｎ個の上記フレームを意味し、Ｎ個のフレームに含まれる複数のポーズをポーズ列とする。例えばＮ＝５の場合は、現在の１フレームと現在のフレームより前の４フレームとを合わせた５フレームに含まれる複数のポーズがポーズ列となる。また、例えばＮ＝１の場合は、現在の１フレームに含まれる複数のポーズがポーズ列となる。 The speech rate calculation unit 14, for N frames (N is a natural number) before the frame including the current frame detected by the pause detection unit 12, pause length and statistical data storage unit 13 of each pause included in the N frame. Is used to calculate the speech rate of the N frame, and the speech rate of the N frame is set as the speech rate of the current frame. Here, the N frame means the N frames, and a plurality of poses included in the N frames is a pose sequence. For example, when N = 5, a plurality of poses included in 5 frames including the current 1 frame and the 4 frames before the current frame become a pose sequence. Further, for example, when N = 1, a plurality of poses included in one current frame is a pose sequence.

また、ポーズ列の時間長Ｔｆｎは、時間長ＴｆのフレームがＮ個あるのでＴｆｎ＝Ｔｆ＊Ｎとなる。例えば、ポーズ列の時間長をＴｆｎ＝１０秒、Ｎ＝１０とすると、フレームの時間長はＴｆ＝１秒となり、１０秒（Ｔｆｎ）間に含まれる各ポーズのポーズ長から発話速度を算出し、現在の１秒（Ｔｆ）間の発話速度とされる。なお、上記に挙げた各時間は何ら限定する値ではない。フレームの時間長（Ｔｆ）が短いほど発話速度が反映される時間区間は短くなる。また、ポーズ列の時間長（Ｔｆｎ）が長いほど当該ポーズ列に含まれるポーズの数が増え、より精度の高い発話速度が算出できるが、ポーズ列の時間長（Ｔｆｎ）が長すぎると実際の話者の発話速度の変化に対応が遅れたり、装置の処理負担が大きくなったりする。以上の特徴を踏まえ、ポーズ列の時間長Ｔｆｎおよびフレームの時間長Ｔｆを適宜設定する。図９は、発話速度算出部１４の構成を示すブロック図である。発話速度算出部１４は、ポーズ長計測部１４１、ポーズ分類部１４２、ポーズ頻度算出部１４３、および発話速度換算部１４４を備える。 Also, the time length Tfn of the pause sequence is Tfn = Tf * N because there are N frames of time length Tf. For example, if the time length of the pause sequence is Tfn = 10 seconds and N = 10, the time length of the frame is Tf = 1 second, and the speech rate is calculated from the pause length of each pause included in 10 seconds (Tfn). , The speech rate for the current one second (Tf). In addition, each time quoted above is not a limit value at all. The shorter the time length (Tf) of the frame, the shorter the time interval in which the speech rate is reflected. In addition, as the time length (Tfn) of the pose sequence increases, the number of poses included in the pose sequence increases, and a more accurate speech rate can be calculated. However, if the time length (Tfn) of the pose sequence is too long, Response to changes in the speaking rate of the speaker may be delayed, and the processing burden on the device may increase. Based on the above characteristics, the time length Tfn of the pause sequence and the time length Tf of the frame are appropriately set. FIG. 9 is a block diagram showing the configuration of the speech rate calculation unit 14. The utterance speed calculation unit 14 includes a pose length measurement unit 141, a pose classification unit 142, a pose frequency calculation unit 143, and an utterance speed conversion unit 144.

図９において、ポーズ長計測部１４１は、ポーズ検出部１２で検出された現在のフレームを含む当該フレーム以前のＮフレーム内に対する各ポーズのポーズ長を計測する。 In FIG. 9, the pose length measurement unit 141 measures the pose length of each pose with respect to the N frames before the frame including the current frame detected by the pose detection unit 12.

ポーズ分類部１４２は、ポーズ長計測部１４１で計測されたポーズ列を構成する各ポーズ長から、各ポーズがどのクラスに属するかを分類する。図１０は、ポーズ分類部１４２の構成を示すブロック図である。ポーズ分類部１４２は、クラス識別部１４５、ポーズ列判定部１４６、およびクラス決定部１４７を備える。 The pose classification unit 142 classifies the class to which each pose belongs from each pose length constituting the pose string measured by the pose length measurement unit 141. FIG. 10 is a block diagram illustrating a configuration of the pose classification unit 142. The pose classification unit 142 includes a class identification unit 145, a pose string determination unit 146, and a class determination unit 147.

クラス識別部１４５は、ポーズ長計測部１４１で計測されたポーズ列を構成する各ポーズ長と統計データ格納部１３に格納される統計データとから、ポーズ列を構成する各ポーズの属するクラス（以下、所属クラスという）を識別する。具体的には、ポーズがどのクラスに適合するかという度合い（以下、適合度合いＬという）を全てのクラスについてポーズ毎に算出する。適合度合いＬは以下の式により算出される。
Ｌ＝１／ｄ …（１）
ｄ＝｜ｘ−ａ｜／Ｓ …（２）
ここで、ｄは一般にマハラノビス距離と呼ばれるものを示し、Ｓはクラスの標準偏差を示し、ａはクラスの平均値を示す。また、ｘは適合を算出する対象となるポーズのポーズ長を示す。上式の算出結果より、適合度合いＬが最大値となるクラスをそのポーズの所属クラスとして識別する。 The class identifying unit 145 includes a class (hereinafter referred to as a class to which each pose constituting the pose sequence belongs) from each pose length constituting the pose sequence measured by the pose length measuring unit 141 and the statistical data stored in the statistical data storage unit 13. , Which belongs to the class). Specifically, the degree to which class the pose matches (hereinafter referred to as the matching level L) is calculated for every class for each pose. The matching degree L is calculated by the following formula.
L = 1 / d (1)
d = | x−a | / S (2)
Here, d indicates what is generally called Mahalanobis distance, S indicates the standard deviation of the class, and a indicates the average value of the class. X indicates the pose length of the pose for which the adaptation is calculated. Based on the calculation result of the above equation, the class having the maximum matching level L is identified as the class to which the pose belongs.

ポーズ列判定部１４６は、クラス識別部１４５で識別されたポーズ列内の各ポーズの所属クラスのうち、発話速度属性の区分がポーズ列を構成する全てのポーズに対して一致しているかを調べる。発話速度属性の区分が全て一致する場合は、その区分を当該ポーズ列の発話速度区分と判定する。一方、発話速度属性の区分が一部一致しない場合は、ポーズ列を構成する各ポーズにおいて語句属性の区分が文間となるポーズの発話速度属性の区分のうち、最も多い区分をそのポーズ列の発話速度区分として判定する。ここで、発話速度区分を判定する上で文間のポーズを用いたのは、発話速度属性の区分によるポーズ長の傾向は、文間ポーズのほうが文内ポーズと比べてより顕著に見られるためである（図５参照）。 The pose sequence determination unit 146 checks whether the utterance speed attribute classification is the same for all poses constituting the pose sequence among the classes belonging to each pose in the pose sequence identified by the class identification unit 145. . If all the speech rate attribute categories match, the category is determined to be the speech rate category of the pause sequence. On the other hand, if the utterance speed attribute classifications do not partially match, the utterance speed attribute classification of the pose string in which the phrase attribute classification is between sentences in each pose constituting the pose string is the most frequent classification of the pose string. It is determined as the speech rate category. Here, the pause between sentences was used to determine the speech rate classification because the tendency of pause length due to the classification of speech speed attributes is more noticeable in the sentence pause than in the sentence pause. (See FIG. 5).

クラス決定部１４７は、クラス識別部１４５で識別された発話速度属性の区分がポーズ列判定部１４６で判定された発話速度区分と一致しなかったポーズに対して、当該ポーズの所属クラスを再判定する。まず、その一致しなかったポーズの発話速度属性の区分は、ポーズ列判定部１４６で判定された発話速度区分に変更される。次に、その変更された発話速度区分となる語句属性の区分（文間および文内）において適合度合いＬの値を比較する。そして、適合度合いＬの値が大きいクラスの語句属性の区分が当該一致しなかったポーズの語句区分として決定される。以上のように、当該一致しなかったポーズの所属クラスが決定する。このように、ポーズ列を用いて判定することにより、例えばポーズが「速い、文間」または「遅い、文内」のいずれのクラスに属するかのように、時間長だけでは区別が難しいポーズの所属クラスを正確に判定することができる。以上のように、ポーズ分類部１４２は、ポーズ長計測部１４１で計測されたポーズ列を構成する各ポーズ長から、各ポーズがどのクラスに属するかを分類する。 The class determination unit 147 re-determines the affiliation class of the pose for the pose whose utterance rate attribute identified by the class identification unit 145 does not match the utterance rate classification determined by the pose sequence determination unit 146. To do. First, the utterance speed attribute classification of the non-matching pose is changed to the utterance speed classification determined by the pose string determination unit 146. Next, the value of the matching level L is compared in the phrase attribute classification (between sentences and within the sentence) that becomes the changed speech speed classification. Then, the classification of the phrase attribute of the class having a large matching level L is determined as the phrase classification of the pose that did not match. As described above, the affiliation class of the non-matching pose is determined. In this way, by using the pose sequence, it is possible to identify poses that are difficult to distinguish only by the length of time, such as whether the pose belongs to the “fast, sentence-to-sentence” or “slow, in-sentence” class. It is possible to accurately determine the belonging class. As described above, the pose classification unit 142 classifies which class each pose belongs to from each pose length constituting the pose string measured by the pose length measurement unit 141.

ポーズ頻度算出部１４３は、ポーズ分類部１４２で分類されたポーズ列を構成する各ポーズの所属クラスのうち、語句区分が文内と分類されたポーズの生起頻度をポーズ長別に算出する。このとき、ポーズ列判定部１４６で判定された発話速度属性（例えば速い・普通・遅いの３区分）に応じて、文内ポーズのポーズ長の生起頻度の度数を集計する階級数および階級幅を選択する。ここで、階級数とは、所定範囲を区間（階級）に分割する数を示し、その分割された区間の幅を階級幅という。 The pose frequency calculation unit 143 calculates, for each pose length, the frequency of occurrence of poses whose word / phrase classification is classified as in the sentence among the classes belonging to each pose constituting the pose row classified by the pose classification unit 142. At this time, according to the speech rate attribute determined by the pose sequence determination unit 146 (for example, three categories of fast, normal, and slow), the number of classes and the class width for counting the frequency of occurrence of the pose length of the in-sentence pose are calculated. select. Here, the class number indicates the number by which a predetermined range is divided into sections (classes), and the width of the divided section is referred to as a class width.

ここで、本実施形態では、文内ポーズのポーズ長の生起頻度の度数を集計する階級数を次のように設定する。例えば、速い場合は１６０ミリ秒以上から２２０ミリ秒未満の時間範囲で階級数８とし、普通の速さの場合は２３０ミリ秒以上から２９０ミリ秒未満の時間範囲で階級数６とし、遅い場合は３００ミリ秒以上から３６０ミリ秒未満の時間範囲で階級数４とする。 Here, in the present embodiment, the number of classes for counting the frequency of occurrence frequency of the pose length of the in-sentence pose is set as follows. For example, when the speed is fast, the class number is 8 in the time range from 160 milliseconds to less than 220 milliseconds, and when the speed is normal, the class number is 6 in the time range from 230 milliseconds to less than 290 milliseconds, and the case is slow. Is 4 in the time range from 300 milliseconds to less than 360 milliseconds.

上記時間範囲に設定した理由としては、アナウンサーの音声の発話速度は、速いもので１モーラあたり８０ミリ秒前後である。また、遅い発話速度においても１モーラあたり１８０ミリ秒を超えると自然な韻律を保つのが難しい。そこで、アナウンサーのように単独で文章を読み上げる発話スタイルでは、発話速度は８０ミリ秒／モーラから１８０ミリ秒／モーラの間に存在する。ここで、図４に示されるように、文内ポーズのポーズ長は統計的に２モーラ付近の長さとなる。したがって、文内ポーズのポーズ長は２モーラとする統計的な関係式を用いて、発話速度に換算するためのポーズが取り得る範囲は、１６０ミリ秒から３６０ミリ秒の範囲となる。なお、上記の文内ポーズの生起頻度の度数を集計する時間範囲、階級数および階級幅は、一例であって適宜設定されても良い。 The reason for setting the above time range is that the voice rate of the announcer's voice is fast and around 80 milliseconds per mora. In addition, it is difficult to maintain a natural prosody at a low utterance speed if it exceeds 180 milliseconds per mora. Therefore, in an utterance style in which a sentence is read out independently like an announcer, the utterance speed exists between 80 milliseconds / mora and 180 milliseconds / mora. Here, as shown in FIG. 4, the pose length of the in-sentence pose is statistically about 2 mora. Therefore, the range that can be taken by the pose for conversion to the speech speed using a statistical relational expression in which the pose length of the in-sentence pose is 2 mora is a range from 160 milliseconds to 360 milliseconds. It should be noted that the time range, the number of classes, and the class width for counting the frequency of occurrence of the in-sentence pauses described above are examples and may be set as appropriate.

発話速度換算部１４４は、ポーズ頻度算出部１４３で算出された文内ポーズの生起頻度うち最大の度数となるポーズ長を求めて、当該ポーズ長を発話速度に換算する。文内ポーズのポーズ長を発話速度に換算する方法としては、最大の度数となるポーズ長が属する階級の中央値を上述の統計的な関係式（２モーラ分の長さ）を用いて発話速度に換算する。ここで、生起頻度が同じ値になる階級が複数ある場合は、隣接する階級であればその複数の階級に属するポーズ長の平均値を用いる。生起頻度が同じ値になる複数の階級が隣接しない場合は、同じ値になる各階級と各階級に近傍する階級との生起頻度をそれぞれ足し合わせた値を比較し、最も大きい階級を採用する。 The utterance speed conversion unit 144 obtains the pose length that is the maximum frequency among the occurrence frequencies of the in-sentence poses calculated by the pose frequency calculation unit 143, and converts the pose length into the utterance speed. As a method of converting the pose length of the pose in the sentence to the utterance speed, the median value of the class to which the maximum pose length belongs is used to calculate the utterance speed using the above statistical relational expression (length of 2 mora). Convert to. Here, when there are a plurality of classes having the same occurrence frequency, the average value of the pose lengths belonging to the plurality of classes is used for adjacent classes. When a plurality of classes having the same occurrence frequency are not adjacent to each other, the values obtained by adding the occurrence frequencies of each class having the same value and a class adjacent to each class are compared, and the largest class is adopted.

なお、以上の説明では、統計的な関係式として文内ポーズのポーズ長の長さを２モーラとしたが、これに限定されない。例えば対象となる話者毎に最適値を設定しても良い。その結果、より精度の高い発話速度に換算できる。以上のように発話速度算出部１４は、ポーズ列毎の発話速度を算出する。そして、当該ポーズ列の発話速度を当該ポーズ列を構成する現在のフレームの発話速度とする。 In the above description, the length of the pose length in the sentence pose is 2 mora as a statistical relational expression, but the present invention is not limited to this. For example, an optimum value may be set for each target speaker. As a result, the speech rate can be converted with higher accuracy. As described above, the speaking rate calculation unit 14 calculates the speaking rate for each pause sequence. Then, the speech rate of the pause sequence is set as the speech rate of the current frame constituting the pause sequence.

ここで、音声信号からポーズを検出して、当該ポーズのポーズ長に基づいて発話速度を算出するまでの処理の流れについて図１１〜図１３を用いて説明する。図１１は、ポーズ長に基づいて発話速度を算出するまでの処理の流れを表すフローチャートである。図１２は、ポーズ列を構成する各ポーズの適合度合いＬの算出結果の一例を示す図である。なお、図１２において、ポーズＴｐ１〜Ｔｐ５は、図７のポーズＴｐ１〜Ｔｐ５に相当する。また、図１２の算出結果は、図８に示す統計データの値を用いて算出されたものであり、現在の１フレーム分のみ示され、それ以外のＮフレーム分については省略している。図１３は、後述するステップＳ３〜Ｓ６のそれぞれの結果を示す図である。なお、以下の説明を具体的にするために、音声信号格納部１１に格納された音声信号に含まれる話者の音声の発話速度は遅いとし、図１３に示す「正解」の列には、当該発話速度（遅い）に基づく各ポーズの所属クラスを示している。 Here, a flow of processing from detection of a pause from an audio signal to calculation of an utterance speed based on the pause length of the pause will be described with reference to FIGS. FIG. 11 is a flowchart showing the flow of processing until the speech rate is calculated based on the pause length. FIG. 12 is a diagram illustrating an example of a calculation result of the matching degree L of each pose constituting the pose row. In FIG. 12, pauses Tp1 to Tp5 correspond to pauses Tp1 to Tp5 in FIG. Further, the calculation result of FIG. 12 is calculated using the statistical data values shown in FIG. 8, and only the current one frame is shown, and the other N frames are omitted. FIG. 13 is a diagram illustrating the results of steps S3 to S6 described later. In order to make the following description more specific, it is assumed that the speech rate of the speaker's voice included in the voice signal stored in the voice signal storage unit 11 is slow, and the “correct answer” column shown in FIG. The affiliation class of each pose based on the utterance speed (slow) is shown.

まず、ポーズ検出部１２（図１参照）は、フレーム毎に音声区間および非音声区間を判定し、当該非音声区間をポーズとして検出する（ステップＳ１）。次に、ポーズ長計測部１４１は、ポーズ検出部１２で検出された現在のフレームを含む当該フレーム以前のＮフレーム内の各ポーズ（ポーズ列）のポーズ長を計測し（ステップＳ２）、処理を次のステップに進める。 First, the pause detection unit 12 (see FIG. 1) determines a voice interval and a non-voice interval for each frame, and detects the non-voice interval as a pause (step S1). Next, the pose length measurement unit 141 measures the pose length of each pose (pose sequence) in the N frames before the frame including the current frame detected by the pose detection unit 12 (step S2), and performs processing. Proceed to the next step.

次に、クラス識別部１４５は、ポーズ長計測部１４１で計測されたポーズ列を構成する各ポーズ長と統計データ格納部１３に格納される統計データとを用いて全てのクラスに対する上記式（１）および式（２）よって、適合度合いＬをポーズ毎に算出する。そして、ポーズ列内を構成する各ポーズにおいて適応度合いＬの値が最大値となるクラスをそのポーズの所属クラスとして識別して（ステップＳ３）処理を次のステップに進める。ここで、図１２に示される適合度合いＬの算出結果において、例えばポーズＴｐ１は「文間、遅い」となるクラスの適合度合いＬの値が「１．２」で最大値となり、ポーズＴｐ１の所属クラスは「文間、遅い」と識別される。以上の方法で、ポーズ列を構成する各ポーズについてそれぞれ所属クラスを識別する。なお、ポーズＴｐ１〜Ｔｐ５の識別結果は、図１３に示される結果となる。 Next, the class identifying unit 145 uses the pose lengths measured by the pose length measuring unit 141 and the statistical data stored in the statistical data storage unit 13 using the above formulas (1). ) And equation (2), the matching degree L is calculated for each pose. Then, the class having the maximum adaptability L value in each pose constituting the pose row is identified as the class to which the pose belongs (step S3), and the process proceeds to the next step. Here, in the calculation result of the matching level L shown in FIG. 12, for example, the pose Tp1 becomes the maximum value when the matching level L of the class “sentence between sentences is slow” is “1.2”, and the pose Tp1 belongs to The class is identified as “between sentences, slow”. With the above method, the belonging class is identified for each pose constituting the pose string. The identification results of the pauses Tp1 to Tp5 are the results shown in FIG.

次に、ポーズ列判定部１４６は、クラス識別部１４５で識別されたポーズ列を構成する各ポーズの所属クラスのうち、全てのポーズに対して発話速度属性の区分が一致しているか判定する（ステップＳ４）。そして、ポーズ列判定部１４６は、発話速度属性の区分が全て一致する場合、その区分が当該ポーズ列の発話速度区分であると判定し、次のステップＳ７へ処理を進める。一方、ポーズ列判定部１４６は、発話速度属性の区分が一部一致しない場合、次のステップＳ５に処理を進める。 Next, the pose sequence determination unit 146 determines whether the utterance speed attribute classifications are the same for all poses of the classes belonging to each pose constituting the pose sequence identified by the class identification unit 145 ( Step S4). Then, when the speech rate attribute categories all match, the pause sequence determination unit 146 determines that the category is the speech rate category of the pause sequence, and proceeds to the next step S7. On the other hand, the pose sequence determination unit 146 proceeds to the next step S5 when the utterance speed attribute classifications do not partially match.

ステップＳ５において、クラス決定部１４７は、ポーズ列を構成する各ポーズにおいて語句属性の区分が文間となるポーズの発話速度属性の区分のうち、最も多い区分が当該ポーズ列の発話速度区分であると判定する。そして、クラス決定部１４７は、上記ステップＳ３で識別された所属クラスの発話速度属性の区分と上記ステップＳ５において判定された発話速度区分とが一致しなかったポーズに対して、当該ポーズの所属クラスを再判定し（ステップＳ６）、処理を次のステップＳ７に進める。具体的には、その一致しなかったポーズの発話速度属性の区分は、上記ステップＳ５で判定された発話速度区分に変更される。次に、その変更された発話速度区分となる語句属性の区分（文間および文内）において適合度合いＬ値を比較する。そして、適合度合いＬの値が大きいクラスの語句属性の区分が当該一致しなかったポーズの語句区分として決定される。以上のように、当該一致しなかったポーズの所属クラスが決定される。 In step S5, the class determination unit 147 has the highest number of utterance speed categories of the pose string among the utterance speed attribute categories of poses in which the phrase attribute classification is between sentences in each pose constituting the pose string. Is determined. Then, the class determination unit 147 performs, for a pose in which the utterance speed attribute classification of the belonging class identified in step S3 and the utterance speed classification determined in step S5 do not match, the belonging class of the pose. Is re-determined (step S6), and the process proceeds to the next step S7. Specifically, the utterance speed attribute classification of the pose that did not match is changed to the utterance speed classification determined in step S5. Next, the matching degree L value is compared in the phrase attribute division (between sentences and within the sentence) that becomes the changed speech speed classification. Then, the classification of the phrase attribute of the class having a large matching level L is determined as the phrase classification of the pose that did not match. As described above, the belonging class of the pose that did not match is determined.

ここで、Ｎ＝１とするポーズ列のポーズ識別結果を例にとってステップＳ３〜Ｓ６の流れを具体的に説明する。図１３に示す１フレーム分のポーズ識別結果では、発話速度属性の区分が「遅い」となるポーズが４箇所、「速い」となるポーズ１箇所となっており、発話速度属性の区分は一部一致しない（ステップＳ４）。また、ステップＳ５で語句属性の区分が「文間」となるポーズの発話速度属性の区分は、「遅い」が３箇所、「速い」が１箇所となる。そして、「遅い」となる区分が最も多いので、当該フレーム内の各ポーズの発話速度区分は、「遅い」と判定される。 Here, the flow of steps S3 to S6 will be described in detail with reference to an example of the pose identification result of the pose row where N = 1. In the pose identification result for one frame shown in FIG. 13, the utterance speed attribute classification is “slow” with 4 poses and “fast” pose with one utterance speed attribute classification. They do not match (step S4). Also, in step S5, the phrase speed attribute category of the pause in which the phrase attribute category is “between sentences” is “slow” at three locations and “fast” at one location. Since there are the most “slow” segments, the speech rate category of each pose in the frame is determined to be “slow”.

次に、ステップＳ５で発話速度区分が「遅い」と判定されたので、ステップＳ３で発話速度属性の区分が「速い」と識別されたポーズＴｐ４は一致しない。よって、ポーズＴｐ４の発話速度属性の区分をステップＳ５で判定された発話速度区分「遅い」に変更する。そして、図１２において、ポーズＴｐ４の発話速度属性の区分が「遅い」となる語句属性「文間」および「文内」の区分において適合度合いＬの値を比較する。このとき、「遅い、文内」となるクラスの適合度合いＬの値は１．１１となり、「遅い、文間」となるクラスの適合度合いＬの値は０．４１となる。したがって、適合度合いＬの値は「遅い、文内」のクラスのほうが大きいため、ポーズＴｐ４の語句区分は「文内」として決定される。そして、決定されたポーズ列を構成する各ポーズの所属クラスは、実際の話者の音声の発話速度に基づく所属クラス（図１３に示す「正解」の列）と一致する。以上のように、ポーズ列を構成する全てのポーズの所属クラスを、実際の話者の発話速度に基づく各ポーズの所属クラスに一致させることができる。 Next, since it is determined in step S5 that the speech rate category is “slow”, the pose Tp4 in which the speech rate attribute category is identified as “fast” in step S3 does not match. Therefore, the speech rate attribute category of pause Tp4 is changed to the speech rate category “slow” determined in step S5. Then, in FIG. 12, the value of the matching level L is compared between the phrase attributes “between sentences” and “within sentence” in which the category of the speech rate attribute of pause Tp4 is “slow”. At this time, the value of the adaptation level L of the class “slow, in sentence” is 1.11, and the value of the adaptation level L of the class “slow, between sentences” is 0.41. Accordingly, since the value of the matching level L is larger in the “slow, in sentence” class, the phrase division of the pause Tp4 is determined as “in sentence”. The affiliation class of each pose constituting the determined pose sequence matches the affiliation class (“correct answer” column shown in FIG. 13) based on the actual speech rate of the voice of the speaker. As described above, the affiliation classes of all poses constituting the pose sequence can be matched with the affiliation classes of the respective poses based on the actual speaking rate of the speaker.

ステップＳ７において、ポーズ頻度算出部１４３は、上記ステップＳ４もしくは上記ステップＳ６で決定されたポーズ列を構成する各ポーズの所属クラスのうち、語句区分が「文内」と決定されたポーズの生起頻度の度数をポーズ長別に算出する。そして、発話速度換算部１４４は、ステップＳ７で算出された文内ポーズの生起頻度うち最大度数となるポーズ長を求めて、統計的な関係式を用いて発話速度に換算する（ステップＳ８）。その結果、ポーズ列毎の発話速度が算出され、当該ポーズ列の発話速度が当該ポーズ列を構成する現在のフレームの発話速度とする。次に、処理を継続する場合は、上記ステップＳ１に戻って処理を継続し、処理を終了する場合は、当該フローチャートによる処理を終了する（ステップＳ９）。以上でポーズを検出して、当該ポーズのポーズ長に基づいて発話速度を算出する処理の流れについての説明を終了する。 In step S7, the pose frequency calculation unit 143 generates the frequency of occurrence of poses whose word category is determined to be “in-sentence” among the classes belonging to each pose constituting the pose sequence determined in step S4 or step S6. The frequency is calculated for each pause length. Then, the utterance speed conversion unit 144 obtains the pose length that is the maximum frequency among the occurrence frequencies of the in-sentence poses calculated in step S7, and converts it to the utterance speed using a statistical relational expression (step S8). As a result, the utterance speed for each pose string is calculated, and the utterance speed of the pose string is set as the utterance speed of the current frame constituting the pose string. Next, when the process is continued, the process returns to step S1 to continue the process, and when the process is terminated, the process according to the flowchart is terminated (step S9). This is the end of the description of the flow of processing for detecting a pose and calculating the speech rate based on the pose length of the pose.

図６に戻り、発話速度制御部１５は、予め定められた制御方法に基づいて、圧縮伸縮率を算出する。音声信号に含まれる話者の音声の発話速度をユーザが所望する発話速度で再生するための制御方法を発話速度制御部１５に設定しておく。 Returning to FIG. 6, the speech rate control unit 15 calculates the compression / expansion rate based on a predetermined control method. A control method for reproducing the speech rate of the speaker's voice included in the speech signal at the speech rate desired by the user is set in the speech rate control unit 15.

具体的な制御方法として、例えば、ニュースでは平均９．５（モーラ／秒）の速さで発話される。そこで、現在のフレームの発話速度が８（モーラ／秒）になるように、算出されたポーズ列の発話速度が８（モーラ／秒）よりも速い場合は、現在のフレームの発話速度が８（モーラ／秒）になるように、圧縮伸縮率を時間軸伸長するように制御する。また、算出されたポーズ列の発話速度が８（モーラ／秒）よりも遅い場合は、現在のフレームの発話速度が８（モーラ／秒）になるように、圧縮伸縮率を時間軸圧縮するように制御する方法がある。 As a specific control method, for example, news is uttered at an average speed of 9.5 (mora / second). Therefore, when the calculated speech rate of the pause sequence is higher than 8 (mora / second) so that the speech rate of the current frame is 8 (mora / second), the speech rate of the current frame is 8 ( The compression / expansion rate is controlled so as to extend along the time axis so as to achieve mora / second. If the calculated speech rate of the pose sequence is slower than 8 (mora / second), the compression / expansion rate is compressed on the time axis so that the speech rate of the current frame is 8 (mora / second). There are ways to control.

また、文頭では話題の転換なども多く、聞き逃すと後に続く語句の内容が把握し難いことがある。そこで、文間のポーズ後の語句が他のポーズ後の語句に比べ話速変換後の発話速度が遅くなるように圧縮伸縮率を制御する制御方法を予め設定しても良い。 Also, there are many topic changes at the beginning of the sentence, and if you miss it, it may be difficult to grasp the contents of the words that follow. Therefore, a control method for controlling the compression / expansion rate may be set in advance so that the phrase after the pause between sentences is slower than the phrase after the other pause.

速度変換部１６は、発話速度制御部１５で算出された圧縮伸長率に基づいて音声信号格納部１１に格納された音声信号を時間軸上にて圧縮伸長することにより、話者の音声を低速から高速までユーザが所望する速度に話速変換する。ここで、音声信号の圧縮伸長方法には、例えば、特許第３１５６０２０号公報に開示された音声速度変換方法があるが、この方法に限定されることなく、他の方法を用いても良い。 The speed conversion unit 16 compresses and decompresses the voice signal stored in the voice signal storage unit 11 on the time axis based on the compression / expansion rate calculated by the speech speed control unit 15, thereby reducing the speed of the speaker's voice. The speech speed is converted to a speed desired by the user from high speed to high speed. Here, the audio signal compression / decompression method includes, for example, the audio speed conversion method disclosed in Japanese Patent No. 3156020, but is not limited to this method, and other methods may be used.

以上のように、第１の実施形態で説明する音声速度変換装置は、音声信号からポーズを検出し、当該ポーズのポーズ長に基づいて発話速度を正確に算出することができる。その結果、話者の音声の発話速度に応じてユーザが所望する再生音声の発話速度に話速変換することができる。 As described above, the audio speed conversion device described in the first embodiment can detect a pause from an audio signal and accurately calculate the speech rate based on the pause length of the pause. As a result, it is possible to convert the speech speed to the speech speed of the playback voice desired by the user according to the speech speed of the speaker's voice.

なお、以上の説明において、発話速度属性の区分は「速い、普通、遅い」の３つの区分としているが、「速い、遅い」の２つの区分でもかまわない。また、例えば「速い、やや速い、普通、やや遅い、遅い」の５区分など４区分以上でも良いことは言うまでもない。このとき、ポーズ列を構成する各ポーズが発話速度属性の区分に応じた所属クラスに判定され、話者の音声の発話速度が発話速度属性の区分に応じた分解能で算出することができる。 In the above description, the speech rate attribute is classified into three categories “fast, normal, slow”, but may be two categories “fast, slow”. Needless to say, four or more categories such as “Fast, Slightly Fast, Normal, Slightly Slow, Slow” may be used. At this time, each pose constituting the pose sequence is determined as a class belonging to the category of the speech rate attribute, and the speech rate of the speaker's voice can be calculated with a resolution corresponding to the category of the speech rate attribute.

なお、本実施形態に係る音声速度変換装置１は、一般的なコンピュータシステムに音声速度変換プログラムを実行させることによって実現されても良い。図１４は、音声速度変換装置１がコンピュータシステム２によって実現される構成例を示すブロック図である。なお、図１４において、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、およびスピーカ１７は、図１における各構成部とそれぞれ同一の機能であるため、説明を省略する。 Note that the audio speed conversion apparatus 1 according to the present embodiment may be realized by causing a general computer system to execute an audio speed conversion program. FIG. 14 is a block diagram illustrating a configuration example in which the audio speed conversion device 1 is realized by the computer system 2. In FIG. 14, the audio signal storage unit 11, the pause detection unit 12, the statistical data storage unit 13, the speech rate calculation unit 14, the speech rate control unit 15, the rate conversion unit 16, and the speaker 17 are each configured in FIG. 1. Since the functions are the same as those of the unit, description thereof is omitted.

図１４において、コンピュータシステム２は、ＣＰＵ２１、記憶部２２、およびディスクドライブ装置２３を備える。ＣＰＵ２１は、音声速度変換プログラムを実行させることによって、上述したポーズ検出部１２、発話速度算出部１４、発話速度制御部１５、および速度変換部１６と同一の機能を実現する。また、記憶部２２は、ハードディスクなどの記録媒体で構成され、音声速度変換プログラムを実行させることによって、上述した音声信号格納部１１および統計データ格納部１３と同一の機能を実現する。ディスクドライブ装置２３は、コンピュータシステム２を音声速度変換装置として機能させるための音声速度変換プログラムが格納された記録媒体２４から、当該音声速度変換プログラムを読み出す。当該音声速度変換プログラムが任意のコンピュータシステム２にインストールされることにより、当該コンピュータシステム２を上述した音声速度変換装置として機能させることができる。そして、コンピュータシステム２で話速変換された再生音声がスピーカ１７から再生される。なお、記録媒体２４は、例えばフレキシブルディスクや光ディスクなどのディスクドライブ装置２３によって読み取り可能な形式の記録媒体である。また、音声速度変換プログラムは、予めコンピュータシステム２にインストールされていてもかまわない。また、スピーカ１７は、コンピュータシステム２に内臓されていても良いし、コンピュータシステムの外部にあっても良い。 In FIG. 14, the computer system 2 includes a CPU 21, a storage unit 22, and a disk drive device 23. The CPU 21 implements the same functions as the pause detection unit 12, the speech rate calculation unit 14, the speech rate control unit 15, and the speed conversion unit 16 by executing the voice speed conversion program. The storage unit 22 is composed of a recording medium such as a hard disk, and implements the same functions as the audio signal storage unit 11 and the statistical data storage unit 13 described above by executing an audio speed conversion program. The disk drive device 23 reads the audio speed conversion program from the recording medium 24 in which the audio speed conversion program for causing the computer system 2 to function as an audio speed conversion device is stored. When the audio speed conversion program is installed in an arbitrary computer system 2, the computer system 2 can function as the above-described audio speed conversion device. Then, the reproduced sound whose speech speed has been converted by the computer system 2 is reproduced from the speaker 17. The recording medium 24 is a recording medium in a format that can be read by the disk drive device 23 such as a flexible disk or an optical disk. The audio speed conversion program may be installed in the computer system 2 in advance. The speaker 17 may be built in the computer system 2 or may be outside the computer system.

なお、以上の説明では、上記音声速度変換プログラムは記録媒体２４によって提供されるとしたが、インターネットなどの電気通信回線によって提供されても良い。また、音声速度変換装置における処理は、全部または一部を話速変換処理デバイスなどのハードウェアによって処理される形態であっても良い。 In the above description, the audio speed conversion program is provided by the recording medium 24, but may be provided by an electric communication line such as the Internet. Further, the processing in the audio speed conversion apparatus may be in a form in which all or a part thereof is processed by hardware such as a speech speed conversion processing device.

（第２の実施形態）
次に、本発明に係る第２の実施形態について図１５を用いて説明する。図１５は、本発明の第２の実施形態に係る音声速度変換装置３を示すブロック図である。図１５において、音声速度変換装置３は、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、スピーカ１７、発話速度格納部３１、速度入力部３２、再生時間算出部３３、および表示部３４を備える。なお、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、およびスピーカ１７は、第１の実施形態で説明した音声速度変換装置１と同一の機能を有するため同一の参照記号を付して詳細な説明を省略する。 (Second Embodiment)
Next, a second embodiment according to the present invention will be described with reference to FIG. FIG. 15 is a block diagram showing an audio speed conversion device 3 according to the second embodiment of the present invention. In FIG. 15, the voice speed conversion device 3 includes a voice signal storage unit 11, a pause detection unit 12, a statistical data storage unit 13, a speech rate calculation unit 14, a speech rate control unit 15, a speed conversion unit 16, a speaker 17, and a speech rate. A storage unit 31, a speed input unit 32, a reproduction time calculation unit 33, and a display unit 34 are provided. The voice signal storage unit 11, the pause detection unit 12, the statistical data storage unit 13, the speech rate calculation unit 14, the speech rate control unit 15, the speed conversion unit 16, and the speaker 17 are the same as those described in the first embodiment. Since it has the same function as the speed converter 1, the same reference symbols are attached and detailed description is omitted.

まず、第１の実施形態で説明したように、フレーム毎の発話速度は、音声信号格納部１１に格納された音声信号からポーズ検出部１２でポーズを検出し、当該ポーズのポーズ長と統計データ格納部１３に格納された統計データとに基づいて発話速度算出部１４で算出される。 First, as described in the first embodiment, the speech rate for each frame is determined by detecting a pause from the speech signal stored in the speech signal storage unit 11 by the pause detection unit 12, and the pause length and statistical data of the pause. Based on the statistical data stored in the storage unit 13, the speech rate calculation unit 14 calculates it.

発話速度格納部３１には、音声信号格納部１１に格納される全データ分について、発話速度算出部１４で算出されたフレーム毎の発話速度のデータがそれぞれ格納される。次に、速度入力部３２で、ユーザが所望する話速変換後の再生音声の発話速度をインジケータなどで画面に表示させておく。そして、ユーザが入力装置（図示せず）を用いて当該発話速度を選択もしくは入力する。 The speech rate storage unit 31 stores speech rate data for each frame calculated by the speech rate calculation unit 14 for all data stored in the audio signal storage unit 11. Next, in the speed input unit 32, the utterance speed of the reproduced voice after the speech speed conversion desired by the user is displayed on the screen with an indicator or the like. Then, the user selects or inputs the utterance speed using an input device (not shown).

発話速度制御部１５は、速度入力部３２で入力された話速変換後の発話速度と発話速度格納部３１に格納された各フレームの発話速度とから、予め定められた制御方法で圧縮伸縮率を算出する。予め定められた制御方法として、例えば各フレームの発話速度が速度入力部３２で入力された発話速度に近づくような制御方法を設定する。 The speech rate control unit 15 compresses the compression / expansion rate by a predetermined control method from the speech rate after the speech rate conversion input by the speed input unit 32 and the speech rate of each frame stored in the speech rate storage unit 31. Is calculated. As a predetermined control method, for example, a control method is set such that the utterance speed of each frame approaches the utterance speed input by the speed input unit 32.

再生時間算出部３３は、発話速度制御部１５で算出された音声信号の全データ分の圧縮伸縮率に基づいて、話速変換後の音声信号全体の再生時間を算出し、表示部３４に表示させる。 The playback time calculation unit 33 calculates the playback time of the entire speech signal after the speech speed conversion based on the compression / expansion rate for all data of the speech signal calculated by the speech rate control unit 15 and displays it on the display unit 34. Let

以上のように、本実施形態に係る音声速度変換装置３は、ユーザが所望する再生音声の発話速度を入力すれば、実際に音声信号の音声または画像を再生することなく、事前に音声信号全体の再生時間が把握できる。 As described above, the audio speed conversion device 3 according to the present embodiment can input the entire audio signal in advance without actually reproducing the audio or image of the audio signal if the speech speed of the playback audio desired by the user is input. You can grasp the playback time.

（第３の実施形態）
次に、本発明に係る第２の実施形態について図１５を用いて説明する。図１５は、本発明の第３の実施形態に係る音声速度変換装置４を示すブロック図である。図１５において、音声速度変換装置４は、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、スピーカ１７、発話速度格納部３１、速度入力部４２、再生速度算出部４３、および表示部４４を備える。なお、音声信号格納部１１、ポーズ検出部１２、統計データ格納部１３、発話速度算出部１４、発話速度制御部１５、速度変換部１６、スピーカ１７、および発話速度格納部３１は、第１および第２の実施形態で説明した音声速度変換装置１および３と同一の機能を有するため同一の参照記号を付して詳細な説明を省略する。 (Third embodiment)
Next, a second embodiment according to the present invention will be described with reference to FIG. FIG. 15 is a block diagram showing an audio speed conversion device 4 according to the third embodiment of the present invention. In FIG. 15, the voice speed conversion device 4 includes a voice signal storage unit 11, a pause detection unit 12, a statistical data storage unit 13, a speech rate calculation unit 14, a speech rate control unit 15, a speed conversion unit 16, a speaker 17, a speech rate. A storage unit 31, a speed input unit 42, a playback speed calculation unit 43, and a display unit 44 are provided. The voice signal storage unit 11, the pause detection unit 12, the statistical data storage unit 13, the speech rate calculation unit 14, the speech rate control unit 15, the speed conversion unit 16, the speaker 17, and the speech rate storage unit 31 are the first and Since it has the same function as the audio speed conversion apparatuses 1 and 3 described in the second embodiment, the same reference symbols are attached and detailed description is omitted.

まず、第１の実施形態で説明したように、フレーム毎の発話速度は、音声信号格納部１１に格納された音声信号からポーズ検出部１２でポーズを検出し、当該ポーズのポーズ長と統計データ格納部１３に格納された統計データとに基づいて発話速度算出部１４で算出される。次に速度入力部３２で、ユーザが所望する話速変換後の音声信号全体の再生時間をインジケータなどで画面に表示させておく。そして、ユーザが上記入力装置を用いて音声信号全体の再生時間を選択もしくは入力する。 First, as described in the first embodiment, the speech rate for each frame is determined by detecting a pause from the speech signal stored in the speech signal storage unit 11 by the pause detection unit 12, and the pause length and statistical data of the pause. Based on the statistical data stored in the storage unit 13, the speech rate calculation unit 14 calculates it. Next, the speed input unit 32 displays the playback time of the entire speech signal after the speech speed conversion desired by the user on the screen with an indicator or the like. Then, the user selects or inputs the reproduction time of the entire audio signal using the input device.

発話速度制御部１５は、時間入力部４２で入力された話速変換後の音声信号全体の再生時間と話速変換前の音声信号全体の再生時間とから、発話速度格納部３１に格納された各フレームの発話速度を用いて、予め定められた制御方法で圧縮伸縮率を算出する。予め定められた制御方法として、例えば話速変換前の音声信号全体の再生時間が時間入力部４２で入力された音声信号全体の再生時間に近づくような制御方法を設定する。 The speech rate control unit 15 stores the speech rate storage unit 31 based on the playback time of the entire speech signal after the speech rate conversion input by the time input unit 42 and the playback time of the entire speech signal before the speech rate conversion. Using the utterance speed of each frame, the compression / expansion rate is calculated by a predetermined control method. As a predetermined control method, for example, a control method is set such that the reproduction time of the entire audio signal before the speech speed conversion approaches the reproduction time of the entire audio signal input by the time input unit 42.

再生速度算出部４３は、発話速度制御部１５で算出された音声信号の全データ分の圧縮伸縮率に基づいて、発話変換後の再生音声の発話速度を算出し、表示部４４に表示させる。 The playback speed calculation unit 43 calculates the speech speed of the playback voice after speech conversion based on the compression / expansion rate for all data of the audio signal calculated by the speech speed control unit 15 and displays the speech speed on the display unit 44.

以上のように、本実施形態に係る音声速度変換装置３は、ユーザが所望する音声信号全体の再生時間を入力すれば、実際に音声信号の音声または画像を再生することなく、事前に発話変換後の再生音声の発話速度が把握できる。 As described above, the audio speed conversion device 3 according to the present embodiment converts the speech in advance without actually reproducing the audio or the image of the audio signal if the reproduction time of the entire audio signal desired by the user is input. The utterance speed of later playback voice can be grasped.

また、第２および第３の実施形態に説明した音声速度変換装置によれば、ユーザが実際に音声信号の音声または画像を再生することなく、事前に話速変換後の情報（再生音声の発話速度および音声信号全体の再生時間など）が把握できることで、よりユーザの意図に合った話速変換を提供することができる。また、話速変換後の情報が表示されることで、ユーザは、より直感的に自己の要望に合致した話速変換を行うことができる。 In addition, according to the audio speed conversion device described in the second and third embodiments, the information (the utterance of the reproduced audio) after the speech speed conversion is performed in advance without the user actually reproducing the audio or image of the audio signal. (Speech speed and playback time of the entire audio signal, etc.) can be grasped, so that speech speed conversion more suitable for the user's intention can be provided. Also, by displaying the information after the speech speed conversion, the user can more intuitively perform the speech speed conversion that matches his / her desire.

本発明に係る音声速度変換装置、音声速度変換方法および音声速度変換プログラムは、話者の音声に当該音声以外の他の信号が重畳した信号であっても話者の音声の発話速度を正確に算出することができ、話速変換を行う音声再生装置、音声認識装置および音声要約装置などの用途にも適用できる。 An audio speed conversion device, an audio speed conversion method, and an audio speed conversion program according to the present invention can accurately determine an utterance speed of a speaker's voice even if the signal is a signal obtained by superimposing a signal other than the voice on the speaker's voice. It can be calculated and applied to uses such as a speech reproduction device, speech recognition device, speech summarization device and the like that perform speech speed conversion.

発話速度の区分（速い、普通、遅い）によるポーズ長別のポーズの生起頻度の一例を示す図The figure which shows an example of the occurrence frequency of the pause according to pause length according to the classification of speech speed (fast, normal, slow) 音声の時間軸波形におけるポーズの一例を示す模式図Schematic diagram showing an example of a pause in the time axis waveform of speech 発話速度の区分（速い、普通、遅い）による文内ポーズの生起頻度の一例を示す図The figure which shows an example of the occurrence frequency of the pause in a sentence by the classification of utterance speed (fast, normal, and slow) 文内ポーズのポーズ長をモーラ単位で示した図The figure which showed the pose length of the sentence pose in mora 発話速度の区分（速い、普通、遅い）による文間ポーズの生起頻度の一例を示す図The figure which shows an example of the occurrence frequency of the pause between sentences by the classification of speech speed (fast, normal, slow) 本発明の第１の実施形態に係る音声速度変換装置１を示すブロック図The block diagram which shows the audio | voice speed converter 1 which concerns on the 1st Embodiment of this invention. １フレーム分の音声信号の時間軸波形の一例を示す図The figure which shows an example of the time-axis waveform of the audio | voice signal for 1 frame 各クラスのポーズ長の統計データの一例を示す図The figure which shows an example of the statistical data of the pose length of each class 発話速度算出部１４の構成を示すブロック図The block diagram which shows the structure of the speech rate calculation part 14 ポーズ分類部１４２の構成を示すブロック図Block diagram showing the configuration of the pose classification unit 142 ポーズ長に基づいて発話速度を算出するまでの処理の流れを表すフローチャートを示す図The figure which shows the flowchart showing the flow of a process until utterance speed is calculated based on pause length. ポーズ列を構成する各ポーズの適合度合いＬの算出結果の一例を示す図The figure which shows an example of the calculation result of the conformity degree L of each pose which comprises a pose row | line | column. ステップＳ３〜Ｓ６のそれぞれの結果を示す図The figure which shows each result of step S3-S6 音声速度変換装置１がコンピュータシステム２によって実現される構成を示すブロック図The block diagram which shows the structure by which the audio | voice speed converter 1 is implement | achieved by the computer system 2 本発明の第２の実施形態に係る音声速度変換装置３を示すブロック図The block diagram which shows the audio | voice speed converter 3 which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る音声速度変換装置４を示すブロック図The block diagram which shows the audio | voice speed converter 4 which concerns on the 3rd Embodiment of this invention. 従来の音声の圧縮伸長装置をＩＣレコーダに適用した構成を示すブロック図A block diagram showing a configuration in which a conventional audio compression / decompression apparatus is applied to an IC recorder

Explanation of symbols

１、３、４音声速度変換装置
１１音声信号格納部
１２ポーズ検出部
１３統計データ格納部
１４発話速度算出部
１５発話速度制御部
１６速度変換部
１７スピーカ
１４１ポーズ長計測部
１４２ポーズ分類部
１４３ポーズ頻度算出部
１４４発話速度換算部
１４５クラス識別部
１４６ポーズ列判定部
１４７クラス決定部
２コンピュータシステム
２１ＣＰＵ
２２記憶部
２３ディスクドライブ装置
２４記録媒体
３１発話速度格納部
３２速度入力部
３３再生時間算出部
３４表示部
４２速度入力部
４３再生速度算出部
４４表示部 1, 3, 4 Audio speed conversion device 11 Audio signal storage unit 12 Pause detection unit 13 Statistical data storage unit 14 Speech rate calculation unit 15 Speech rate control unit 16 Speed conversion unit 17 Speaker 141 Pause length measurement unit 142 Pause classification unit 143 Pause Frequency calculation unit 144 Speech rate conversion unit 145 Class identification unit 146 Pause sequence determination unit 147 Class determination unit 2 Computer system 21 CPU
22 storage unit 23 disk drive device 24 recording medium 31 utterance speed storage unit 32 speed input unit 33 reproduction time calculation unit 34 display unit 42 speed input unit 43 reproduction speed calculation unit 44 display unit

Claims

An audio speed conversion device that converts an audio signal including a voice of a speaker to be converted into an audio speed and reproduces the audio signal,
A non-speech section detecting unit that distinguishes a speech section in which the voice of the speaker is included from the speech signal and a non-speech section in which the speech of the speaker is not included, and detects the non-speech section;
A non-speech section length measuring unit that measures a time length for each non-speech section detected from the speech signal by the non-speech section detector;
Based on the time length of each non-speech segment measured by the non-speech segment length measurement unit, an utterance rate calculation unit that calculates the speech rate of the speaker in the speech signal;
A speech speed conversion device comprising: a speech speed conversion / playback section that converts the speech speed according to the speech speed calculated by the speech speed calculation section and plays back the speech signal.

The speech rate calculation unit includes an occurrence frequency calculation unit that calculates an occurrence frequency of a time length for a non-speech interval measured by the non-speech interval length measurement unit,
The utterance speed calculation unit is configured to determine the utterance of the speaker according to the time length that is the maximum frequency in the occurrence frequency, based on a statistical relational expression between the utterance speed obtained in advance and the time length of the non-speech interval. The speed conversion apparatus according to claim 1, wherein the speed is calculated.

The voice speed conversion device further includes a statistical data storage unit that stores statistical data of a time length of the non-voice section statistically obtained for each of a plurality of preset segments,
The speech rate calculation unit is configured to classify the non-speech section according to the time length of the non-speech section measured by the non-speech section length measurement unit based on the statistical data stored in the statistical data storage unit. A non-speech segment classification unit that classifies each segment and selects one segment from the plurality of segments based on a predetermined condition;
The speech velocity conversion apparatus according to claim 2, wherein the occurrence frequency calculation unit calculates the occurrence frequency using a time length of a non-speech section belonging to the category selected by the non-speech segment classification unit. .

The statistical data storage unit sets the plurality of categories according to speech rates classified into a plurality of categories, and stores the statistical data for each of the categories,
The non-speech segment classification unit is configured to classify the non-speech segment according to the time length of the non-speech segment measured by the non-speech segment length measurement unit based on the statistical data for each segment into which the speech speed is classified. 4. The voice speed conversion device according to claim 3, wherein the voice speed conversion device is classified for each of the categories, and a category in which the non-voice segment is classified most frequently among the categories is selected.

The statistical data storage unit statistically obtained the statistical data for the first section obtained statistically the time length of the non-speech interval occurring immediately after the reading and the non-speech interval time length produced immediately after the punctuation And statistical data for the second category,
The non-speech segment classification unit is configured to perform the non-speech segment according to the time length of the non-speech segment measured by the non-speech segment length measurement unit based on statistical data for each of the first segment and the second segment. And classifying each of the categories and selecting the first category,
The speech velocity conversion device according to claim 3, wherein the occurrence frequency calculation unit calculates the occurrence frequency using a time length of a non-speech interval belonging to the first category.

The statistical data storage unit sets the plurality of categories by a combination of a plurality of classifications of speech speeds, a first category that occurs immediately after reading, and a second category that occurs immediately after a phrase. Is stored statistical data of the length of the non-speech interval obtained statistically,
The non-speech segment classification unit classifies the non-speech segment for each segment according to the time length of the non-speech segment measured by the non-speech segment length measurement unit based on the statistical data for each of the plurality of segments. , By extracting the most classified classification of the non-speech segment from the combination of the plurality of classified classifications according to the speaking speed and the second classification, and determining the classification for the speaking speed, 4. The voice speed conversion apparatus according to claim 3, wherein a classification that is a combination of the classification for the uttered speech speed and the first classification is selected.

The statistical data storage unit sets the plurality of sections in advance according to speaker characteristics, and stores statistical data of the time length of the non-speech section statistically obtained for each of the plurality of sections. The voice speed conversion device according to claim 3, wherein:

The speech rate calculation unit
A non-speech section that classifies the non-speech sections detected by the non-speech section detection unit into a plurality of groups using the time length measured by the non-speech section length measurement unit, and selects one from the plurality of groups A classification section;
An occurrence frequency calculating unit that calculates an occurrence frequency using a time length of a non-speech segment belonging to the group selected by the non-speech segment classification unit;
Based on a statistical relational expression between an utterance speed obtained in advance and a time length of the non-speech interval, an utterance speed conversion that calculates the utterance speed of the speaker according to the time length that is the maximum frequency in the occurrence frequency The audio speed conversion device according to claim 1, further comprising:

A display unit;
The speech speed conversion / playback unit further includes a playback time calculation unit that calculates a playback time for playback by converting the voice speed of the audio signal and displays information indicating the playback time on the display unit. The audio speed conversion device described in 1.

A display unit;
The speech speed conversion / playback unit further includes a playback speed calculation unit that calculates a playback speed at which the audio signal is played back by converting the speech speed and displays information indicating the playback speed on the display unit. The audio speed conversion device described in 1.

A voice speed conversion method for converting a voice signal including a voice of a speaker to be converted into a voice speed and reproducing the voice signal,
A non-speech section detecting step for distinguishing between a speech section in which the speech of the speaker is included from the speech signal and a non-speech section in which the speech of the speaker is not included, and detecting the non-speech section;
A non-speech interval length measuring step for measuring a time length for each non-speech interval detected from the speech signal for a predetermined time by the non-speech interval detection step;
Based on the time length of each non-speech segment measured in the non-speech segment length measurement step, an utterance rate calculation step of calculating the utterance rate of the speaker in the speech signal;
A speech speed conversion method comprising: a speech speed conversion reproduction step of reproducing the speech signal by converting the speech speed according to the speech speed calculated in the speech speed calculation step.

An audio speed conversion program executed by a computer of an audio speed conversion apparatus that converts an audio signal including a voice of a speaker to be converted into an audio speed and reproduces the audio signal,
In the computer,
A non-speech section detecting step for distinguishing between a speech section in which the speech of the speaker is included from the speech signal and a non-speech section in which the speech of the speaker is not included, and detecting the non-speech section;
A non-speech interval length measuring step for measuring a time length for each non-speech interval detected from the speech signal for a predetermined time by the non-speech interval detection step;
Based on the time length of each non-speech segment measured in the non-speech segment length measurement step, an utterance rate calculation step of calculating the utterance rate of the speaker in the speech signal;
A speech speed conversion program for executing a speech speed conversion / reproduction step of performing speech speed conversion on the speech signal according to the speech speed calculated in the speech speed calculation step.