JP2956936B2

JP2956936B2 - Speech rate control circuit of speech synthesizer

Info

Publication number: JP2956936B2
Application number: JP63091131A
Authority: JP
Inventors: 義典北原; 洋一東倉
Original assignee: EI TEI AARU SHICHOKAKU KIKO KENKYUSHO KK
Current assignee: EI TEI AARU SHICHOKAKU KIKO KENKYUSHO KK
Priority date: 1988-04-13
Filing date: 1988-04-13
Publication date: 1999-10-04
Anticipated expiration: 2014-10-04
Also published as: JPH01262598A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は音声合成装置の発声速度制御回路に関し、
特に、音声合成装置において発声速度を変化させた場合
であっても、聴取者にとって自然で聞きやすい合成音を
出力するような発声速度制御回路に関する。Description: TECHNICAL FIELD The present invention relates to an utterance speed control circuit of a speech synthesizer,
In particular, the present invention relates to an utterance speed control circuit that outputs a synthesized sound that is natural and easy for a listener to hear even when the utterance speed is changed in a speech synthesizer.

［従来の技術および発明が解決しようとする課題］従来の音声合成装置は、たとえば特開昭50−153807号
公報に記載されているように、/p/,/t/,/k/のような破
裂性の短時間子音では伸縮率を小さくし、また安定した
母音では伸縮率を大きくして、音韻単位で合成音の発声
速度を制御するようにしていた。しかしながら、単に音
韻別に伸縮率を変えるだけでは、細かい音韻性の保存に
対する考慮が不十分であり、合成音を聴取したとき、不
自然さが残るという欠点があった。[Problems to be Solved by Conventional Techniques and Inventions] Conventional speech synthesizers are, for example, as described in Japanese Patent Application Laid-Open No. 50-153807, such as / p /, / t /, / k /. The expansion and contraction rate is reduced for short bursting consonants, and the expansion and contraction rate is increased for stable vowels to control the utterance speed of synthesized sounds in units of phonemes. However, simply changing the expansion / contraction ratio for each phoneme does not sufficiently consider the preservation of fine phonological properties, and has a disadvantage that unnaturalness remains when a synthesized sound is heard.

それゆえに、この発明の主たる目的は、音声合成装置
において発声速度を変化させた場合であっても、聴取者
にとって自然で聞きやすい合成音声を生成できるような
発声速度制御回路を提供することである。SUMMARY OF THE INVENTION Therefore, a main object of the present invention is to provide a speech rate control circuit that can generate a synthesized speech that is natural and easy for a listener to hear even when the speech rate is changed in a speech synthesizer. .

［課題を解決するための手段］第１請求項に係る発声は、音声を入力するための入力
手段と、入力された音声をディジタル信号に変換するた
めのA/D変換手段と、変換されたディジタル音声信号を
分析して特徴パラメータを抽出する分析手段と、抽出さ
れた特徴パラメータを合成音声に変換する合成音声変換
手段と、変換された合成音声をアナログ信号に変換して
出力するD/A変換手段とを備えた音声合成装置におい
て、変換されたディジタル音声信号を音声情報を担った
スペクトルパラメータに変換するスペクトル分析手段
と、変換されたスペクトルパラメータを用いて時間的な
変化率を求めるスペクトル変化分析手段と、求められた
時間的な変化率に基づいて、特徴パラメータの合成フレ
ーム周期を決定して発声速度を制御する発声速度制御手
段を備え、合成音声変換手段は、決定された合成フレー
ム周期に基づいて合成音声信号の時間伸縮を行なうよう
に構成したものである。[Means for Solving the Problems] The utterance according to the first aspect is obtained by converting input means for inputting voice, A / D converting means for converting input voice into a digital signal, and converting the voice. Analysis means for analyzing a digital speech signal to extract feature parameters; synthetic speech conversion means for converting the extracted feature parameters into synthesized speech; and D / A for converting the converted synthesized speech into an analog signal and outputting the analog signal A spectrum analysis means for converting the converted digital voice signal into a spectrum parameter carrying voice information, and a spectrum change for obtaining a temporal rate of change using the converted spectrum parameter. Utterance speed control for controlling a utterance speed by determining a synthetic frame period of a feature parameter based on the analysis means and the obtained temporal change rate Means for converting the synthesized speech signal into and out of time based on the determined synthesized frame period.

請求項２に係る発明は、文字列または記号列を入力す
るための入力手段と、入力された文字列または記号列を
読み列に変換するための読み列変換手段と、変換された
読み列を合成音声に変換する合成音声変換手段と、変換
された合成音声をアナログ信号に変換するD/A変換手段
とを備えた音声合成装置において、変換された読み列を
音声情報を担ったスペクトルパラメータに変換するスペ
クトル分析手段と、変換されたスペクトルパラメータを
用いて時間的な変化率を求めるスペクトル変化分析手段
と、求められた時間的な変化率に基づいて特徴パラメー
タの合成フレーム周期を決定して発声速度を制御する発
声速度制御手段とを備え、合成音声変換手段は、決定さ
れた合成フレーム周期に基づいて、合成音声信号の時間
伸縮を行なうように構成したものである。According to a second aspect of the present invention, there is provided input means for inputting a character string or a symbol string, reading string conversion means for converting an input character string or a symbol string into a reading string, and converting the converted reading string. In a speech synthesizer comprising a synthesized speech conversion means for converting into a synthesized speech, and a D / A conversion means for converting the converted synthesized speech into an analog signal, the converted read sequence is converted into a spectrum parameter carrying audio information. A spectrum analysis means for converting, a spectrum change analysis means for obtaining a temporal change rate using the converted spectral parameters, and a utterance by determining a composite frame period of the characteristic parameter based on the obtained temporal change rate Speech rate control means for controlling the rate, wherein the synthesized speech conversion means performs time expansion and contraction of the synthesized speech signal based on the determined synthesized frame period. One in which the.

［作用］この発明に係る音声合成装置の発声速度制御回路は、
音声信号または変換された読み列をスペクトルに関する
パラメータに変換し、そのスペクトルパラメータを用い
て時間的な変化率を求め、求めた時間的な変化率に基づ
いて特徴パラメータの合成フレーム周期を決定して発声
速度を制御し、決定された合成フレーム周期に基づいて
合成音声信号の時間伸縮を行なって出力する。[Operation] The utterance speed control circuit of the speech synthesizer according to the present invention comprises:
Convert the audio signal or the converted read sequence into parameters related to the spectrum, determine the temporal change rate using the spectral parameters, determine the composite frame period of the feature parameter based on the determined temporal change rate The utterance speed is controlled, and the synthesized speech signal is time-expanded and output based on the determined synthesized frame period.

［発明の実施例］第１図はこの発明の一実施例の概略ブロック図であ
る。FIG. 1 is a schematic block diagram of an embodiment of the present invention.

第１図において、音声入力部10は発声された音声が入
力されるものであって、その発声された音声は、A/D変
換変換部１に入力され、所定の間隔のサンプリング信号
によりディジタル信号に変換される。ディジタル信号に
変換された音声は、スペクトル分析部３に与えられ、音
韻性情報を担ったスペクトルパラメータに変換される。
このスペクトルパラメータとしては、たとえば斉藤・中
田著“音声情報処理の基礎”（オーム社発行）に記載さ
れているようなPARCOR係数などがある。なお、このPARC
OR係数はスペクトルパラメータの一例であり、この発明
は特にこれに限定されるものではない。In FIG. 1, a voice input unit 10 receives a uttered voice, and the uttered voice is input to an A / D conversion conversion unit 1 and converted into a digital signal by a sampling signal at a predetermined interval. Is converted to The voice converted into the digital signal is supplied to the spectrum analysis unit 3 and is converted into a spectrum parameter carrying phonological information.
The spectral parameters include, for example, PARCOR coefficients described in “Basics of speech information processing” by Saito and Nakata (published by Ohmsha). This PARC
The OR coefficient is an example of a spectrum parameter, and the present invention is not particularly limited to this.

A/D変換部１から出力された音声のディジタル信号は
音声分析部２にも入力され、音声のディジタル信号から
音源情報であるピッチ周期やパワー情報などの音源パラ
メータが抽出される。音源分析部２によって抽出された
音源パラメータとスペクトル分析部３によって分析され
たスペクトルパラメータは合成部５に与えられる。合成
部５はスペクトルパラメータおよび音源パラメータを用
いて合成音声を生成する。The voice digital signal output from the A / D conversion unit 1 is also input to the voice analysis unit 2, and sound source parameters such as pitch period and power information as sound source information are extracted from the voice digital signal. The sound source parameters extracted by the sound source analysis unit 2 and the spectrum parameters analyzed by the spectrum analysis unit 3 are provided to a synthesis unit 5. The synthesis unit 5 generates a synthesized speech using the spectrum parameters and the sound source parameters.

すなわち、合成部５では、前述のピッチ周期の間隔で
単位フレーム内のスペクトルパラメータを繰返し、音声
パラメータの列を変換する。この合成部５としては、た
とえばJ.D.Markel and A.H.Gray Jr著、鈴木訳“音
声の線形予測”（コロナ社発行）に記載されているよう
な２乗算器格子形音声合成フィルタを用いて、前述のス
ペクトルパラメータおよび音源パラメータによい音声波
形を合成する。なお、２乗算器格子形音声合成フィルタ
は一例であって、他の音声合成手段を用いるようにして
もよい。合成された音声波形はD/A変換部７を介して音
声として出力される。That is, the synthesizing unit 5 repeats the spectrum parameter in the unit frame at the above-described pitch period interval, and converts the speech parameter sequence. As the synthesizing unit 5, for example, a two-multiplier lattice type speech synthesizing filter described in “Linear Prediction of Speech” (published by Corona) by JDMarkel and AHGray Jr is used. And synthesize a speech waveform that is good for the sound source parameters. Note that the squaring multiplier-type speech synthesis filter is an example, and other speech synthesis means may be used. The synthesized voice waveform is output as voice via the D / A converter 7.

次に、前述の合成音声の発声速度の制御について説明
する。合成音声の発声速度の制御は、合成部５におい
て、音源パラメータおよびスペクトルパラメータより音
声波形を合成する際に、音源パルスおよびスペクトルパ
ラメータの合成フレーム周期を変えることによって実現
される。このような発声の制御を行なうものが、スペク
トル変化分析部４および発声速度制御部６である。ま
ず、スペクトル変化分析部４では、スペクトル分析部３
において算出されたスペクトルパラメータを用いて時間
的な変化率Δを求める。たとえば、嵯峨山，板倉著、
“音声の動的尺度に含まれる個人性情報”日本音響学会
講演論文集（昭和54年６月発行）に記載されているよう
なLPCケプストラム回帰係数を用いた動的尺度をスペク
トル変化率Δとして使用することができる。もちろんス
ペクトル変化率の尺度はこれに限定されるものではな
い。Next, the control of the utterance speed of the synthesized speech will be described. The control of the utterance speed of the synthesized voice is realized by changing the synthesis frame period of the sound source pulse and the spectrum parameter when the synthesizer 5 synthesizes the voice waveform from the sound source parameter and the spectrum parameter. The utterance control is performed by the spectrum change analyzer 4 and the utterance speed controller 6. First, in the spectrum change analysis unit 4, the spectrum analysis unit 3
The temporal change rate Δ is obtained using the spectrum parameters calculated in the above. For example, Sagayama, Itakura,
The dynamic measure using the LPC cepstrum regression coefficient, as described in the “Personality information included in the dynamic measure of speech”, Proceedings of the Acoustical Society of Japan (June 1979) Can be used. Of course, the scale of the spectrum change rate is not limited to this.

第２図（ａ）は音声のスペクトル変化率Δの時間変化
の一例を示したものであり、この変化率Δのフレームご
との平均値を示したものが第２図（ｂ）である。第２図
（ｃ）は原音声と同じ速度で発声させる場合の音源パル
スおよびスペクトルパラメータの例であり、第２図
（ｃ）においては、第ｉフレーム目と第ｊフレーム目の
音源を表示して示す。また、それぞれのフレームにおけ
るスペクトルパラメータは、それぞれベクトル｛Pⁱ ₁,Pⁱ
₂…Pⁱ _m｝，｛P^j ₁,P^j ₂…P^j _m｝で表わされている。FIG. 2 (a) shows an example of the time change of the voice spectrum change rate Δ, and FIG. 2 (b) shows the average value of the change rate Δ for each frame. FIG. 2 (c) shows an example of sound source pulses and spectral parameters when uttering at the same speed as the original voice. In FIG. 2 (c), the sound sources of the i-th frame and the j-th frame are displayed. Shown. Also, the spectral parameters in each frame are represented by vectors ｛P ⁱ ₁ and P ⁱ , respectively.
₂ ... P ⁱ _m}, is represented by ^{_{^{_{{P j 1, P j 2}}}} ... P j m}.

次に、合成音声の発声速度を低下させる、すなわち音
声を時間軸方向に伸長させる場合について説明する。こ
の発明の一実施例では、第２図（ｂ）に示すように、フ
レームごとに求めたスペクトル変化率Δに基づいて、特
徴パラメータの合成フレーム周期を決定する。第２図
（ｂ）に示した例では、第ｉフレームのスペクトル変化
率Δi,第２図（ｃ）に示した例では第ｉフレーム目の原
音声の特徴パラメータの合成フレーム周期Liに対して、
新たな合成フレーム周期LNiを次の第（１）式により決
定する。Next, a case in which the utterance speed of the synthesized speech is reduced, that is, the speech is extended in the time axis direction will be described. In one embodiment of the present invention, as shown in FIG. 2 (b), the composite frame period of the characteristic parameter is determined based on the spectrum change rate Δ obtained for each frame. In the example shown in FIG. 2B, the spectrum change rate Δi of the i-th frame, and in the example shown in FIG. ,
A new synthesized frame period LNi is determined by the following equation (1).

LNi＝Li×Δmax/Δｉ …（１）ここで、である。このようにして、スペクトル変化率Δの値の大
きさと伸縮率の間に逆相間関係を持たせて、新しく音源
パルスおよび第２図（ｄ）に示すようにスペクトルパラ
メータ｛Pⁱ ₁,Pⁱ ₂…Pⁱ _m｝を配置する。LNi = Li × Δmax / Δi (1) where It is. In this way, an inverse phase relationship is provided between the magnitude of the value of the spectrum change rate Δ and the expansion / contraction rate to newly generate the sound source pulse and the spectrum parameters {P ⁱ ₁ , P ^{i as} shown in FIG. ₂ … ^Pim _m is placed.

上述の手続きは合成音声発声速度を上げる場合にも適
用でき、第（１）式を LNi＝Li×Δi/Δmax …（２）とし、新しい合成フレーム周期LNiを決定する。ここ
で、スペクトル変化率と伸縮率の間の関係は、逆相関で
なくてもよく、任意の関数を用いることができる。The above procedure can also be applied to the case where the synthesized speech utterance speed is increased. Equation (1) is set as LNi = Li × Δi / Δmax (2) and a new synthesized frame period LNi is determined. Here, the relationship between the spectrum change rate and the expansion / contraction rate need not be an inverse correlation, and an arbitrary function can be used.

第３図はこの発明の他の実施例を示す概略ブロック図
である。FIG. 3 is a schematic block diagram showing another embodiment of the present invention.

第３図において、入力端子20には、図示しないOCRや
キーボードなどの入力手段が文字列または記号列で表現
された文章や単語などが入力される。入力された文章や
単語は、形態素解析部８によって、形態素辞書部９に記
憶されている内容に基づいて、形態素の列に変換され
る。形態素辞書部９は少なくとも“読み”および“品
詞”を記憶しており、入力された文字列または記号列に
対して、たとえば相沢，江原著“計算機によるかな漢字
変換",NHH技術研究,25−５に記載されているような最長
一致法などの手段を用いて形態素への分割を行なう。こ
の最長一致法は、形態素分割のための手段の一例であ
り、これに限定されるものではない。In FIG. 3, to an input terminal 20, a sentence or word expressed by a character string or a symbol string by an input means such as an OCR or a keyboard (not shown) is input. The input sentence or word is converted by the morphological analysis unit 8 into a morpheme string based on the contents stored in the morphological dictionary unit 9. The morphological dictionary unit 9 stores at least “reading” and “part of speech”. For example, “Kana-Kanji conversion by computer” by Aizawa and Ehara, NHH Technical Research, 25-5 Is divided into morphemes by means such as the longest match method described in (1). This longest matching method is an example of a means for morpheme division, and is not limited to this.

形態素解析部８によって解析された形態素は、ピッチ
制御処理部10に与えられ、アクセント辞書部11およびア
クセント結合規則部12に記憶されている内容に基づい
て、音の高低を表わすピッチ周波数成分が決定される。
ピッチ周波数成分の付与された形態素列は、音声パラメ
ータ生成部13に与えられ、音素片辞書部14に記憶されて
いる内容に基づいて、音素のパラメータの列に変換され
る。音素片辞書部14は音素片すなわち、文章や単語を構
成している音素、またはCV,VCなどの音韻連鎖をパラメ
ータとして保持しており、形態素を構成する音素または
CV,VCなどの音韻の順に従って該音素片を配列し、上述
のピッチ周波数の間隔で単位フレーム内のパラメータを
繰返し音声パラメータの列に変換する。The morpheme analyzed by the morphological analysis unit 8 is supplied to a pitch control processing unit 10, and a pitch frequency component representing a pitch of a sound is determined based on the contents stored in the accent dictionary unit 11 and the accent combination rule unit 12. Is done.
The morpheme sequence to which the pitch frequency component is added is provided to the voice parameter generation unit 13 and is converted into a sequence of phoneme parameters based on the contents stored in the phoneme segment dictionary unit 14. The phoneme segment dictionary unit 14 holds a phoneme segment, that is, a phoneme constituting a sentence or a word, or a phoneme chain such as CV or VC as a parameter, and stores a phoneme or a phoneme constituting a morpheme.
The phoneme segments are arranged in the order of phonemes such as CV and VC, and the parameters in the unit frame are repeatedly converted into a speech parameter sequence at the pitch frequency intervals described above.

以下、スペクトル分析部15,スペクトル変化分析部16,
合成部17,発声速度制御部18およびD/A変換部19は前述の
第１図に示した実施例と同様の動作を行なう。Hereinafter, the spectrum analysis unit 15, the spectrum change analysis unit 16,
The synthesizing unit 17, the utterance speed control unit 18 and the D / A conversion unit 19 perform the same operations as in the embodiment shown in FIG.

［発明の効果］以上のように、この発明によれば、合成音声の発声速
度を変化させる場合に、スペクトルパラメータを用いて
時間的な変化率を求め、その時間的な変化率に基づいて
特徴パラメータの合成フレーム周期を決定して発声速度
を制御し、決定された合成フレーム周期に基づいて合成
音声信号の時間伸縮を行なうようにしたので、スペクト
ル変化の激しい音韻の時間伸縮による崩壊を防ぐことが
でき、自然で聞きやすい音声を得ることができる。[Effects of the Invention] As described above, according to the present invention, when changing the utterance speed of synthesized speech, a temporal change rate is obtained by using a spectrum parameter, and the characteristic is determined based on the temporal change rate. Since the utterance speed is controlled by determining the synthesis frame period of the parameters, and the time of the synthesized speech signal is expanded or contracted based on the determined synthesis frame period, it is possible to prevent collapse of the phoneme whose spectrum changes drastically due to time expansion and contraction. And a natural and easy-to-listen sound can be obtained.

[Brief description of the drawings]

第１図はこの発明の一実施例の概略ブロック図である。
第２図（ａ）はスペクトル変化率の一例を示す図であ
り、第２図（ｂ）は第２図（ａ）に示したスペクトル変
化率をフレームごとに平均化して示したものであり、第
２図（ｃ）は原音声と同じ速度の場合の音源パルスおよ
ひびスペクトルパラメータの配置図であり、第２図
（ｄ）は第２図（ｂ）に示したスペクトル変化率に基づ
いて決定した新しい音源パルスおよびスペクトルパラメ
ータの配置を示す図である。第３図はこの発明の他の実
施例の概略ブロック図である。図において、１はA/D変換部、２は音源分析部、3,15は
スペクトル分析部、4,16はスペクトル変化分析部、5,17
は合成部、6,18は音声速度制御部、7,19はD/A変換部、
８は形態素解析部、９は形態素辞書部、10はピッチ制御
処理部、11はアクセント辞書部、12はアクセント結合規
則部、13は音声パラメータ生成部、14は音素片辞書部を
示す。FIG. 1 is a schematic block diagram of one embodiment of the present invention.
FIG. 2 (a) is a diagram showing an example of a spectrum change rate, and FIG. 2 (b) shows the spectrum change rate shown in FIG. 2 (a) averaged for each frame. FIG. 2 (c) is an arrangement diagram of sound source pulses and spectrum parameters at the same speed as the original voice, and FIG. 2 (d) is based on the spectrum change rate shown in FIG. 2 (b). It is a figure which shows the arrangement | positioning of the determined new sound source pulse and the spectrum parameter. FIG. 3 is a schematic block diagram of another embodiment of the present invention. In the figure, 1 is an A / D converter, 2 is a sound source analyzer, 3 and 15 are spectrum analyzers, 4 and 16 are spectrum change analyzers, and 5 and 17
Is a synthesis unit, 6 and 18 are voice speed control units, 7 and 19 are D / A conversion units,
Reference numeral 8 denotes a morphological analysis unit, 9 denotes a morphological dictionary unit, 10 denotes a pitch control processing unit, 11 denotes an accent dictionary unit, 12 denotes an accent combination rule unit, 13 denotes a speech parameter generation unit, and 14 denotes a phoneme segment dictionary unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者東倉洋一京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール視聴覚機構研究所内 (56)参考文献特開昭62−245298（ＪＰ，Ａ) 特開昭63−64098（ＪＰ，Ａ) ────────────────────────────────────────────────── ─── Continuing from the front page (72) Inventor Yoichi Higashikura 5th Sanraya, Daiya, Seika-cho, Soraku-gun, Kyoto Pref. ATI Co., Ltd. Vision Auditory Research Institute (56) References JP-A 62-245298 (JP, A) JP-A-63-64098 (JP, A)

Claims

(57) [Claims]

1. An input means for inputting voice, A / D conversion means for converting voice input from the input means into a digital signal, and a digital voice converted by the A / D conversion means. Analysis means for analyzing a signal to extract feature parameters; synthetic speech conversion means for converting the feature parameters extracted by the analysis means into synthesized speech; and a synthesized speech converted by the synthesized speech conversion means into an analog signal. A voice synthesizing device comprising: a D / A converter that converts and outputs the digital voice signal; a spectrum analyzer that converts the digital voice signal converted by the A / D converter into a spectrum parameter that carries phonological information; Spectrum change analysis means for determining a temporal change rate using the spectrum parameter converted by the analysis means; and Based on the temporal rate of change determined by the spectrum change analysis means, comprising a speech rate control means for controlling a speech rate by determining a synthesis frame period of the feature parameter, wherein the speech synthesis means comprises: An utterance speed control circuit for a speech synthesizer, which performs time expansion and contraction of a synthesized speech signal based on a synthesized frame period determined by a speed control means.

2. An input means for inputting a character string or a symbol string; a reading string conversion means for converting a character string or a symbol string input by the input means into a reading string; A speech synthesis device comprising: a synthesized speech conversion unit configured to convert a read string converted by the synthesis speech into a synthesized speech; and a D / A conversion unit configured to convert a synthesized speech converted by the synthesized speech conversion unit into an analog signal. Spectrum analyzing means for converting the read string converted by the read string converting means into spectral parameters bearing phonological information; spectral change analyzing means for obtaining a temporal change rate using the spectral parameters converted by the spectral analyzing means And, based on the temporal change rate obtained by the spectrum change analysis means, An utterance speed control unit that determines a synthetic frame period and controls an utterance speed, wherein the synthesized voice conversion unit performs time expansion and contraction of the synthesized voice signal based on the synthesized frame period determined by the utterance speed control unit An utterance speed control circuit for a speech synthesizer, characterized in that: