JPH1195796A

JPH1195796A - Voice synthesizing method

Info

Publication number: JPH1195796A
Application number: JP9250857A
Authority: JP
Inventors: Masami Akamine; 政巳赤嶺; Takehiko Kagoshima; 岳彦籠嶋; Katsumi Tsuchiya; 勝美土谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-09-16
Filing date: 1997-09-16
Publication date: 1999-04-09

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesizing method by which synthesized voices of superior tone quality can be abtained, the size of a voice element dictionary is compact, and the change in voice quality is easily performed. SOLUTION: In a analysis section 100, the voice elements segmented by a pitch waveform segmenting section 101 are inputted into an LPC analysis section 102 and expressed in the forms of residual signals and LPC coefficients. A set of these spectrum parameters and the residual signals is stored in a residual signal storage section 103 and an LPC coefficient storage section 104 as a voice element dictionary. In an analysis section 200, a selecting section 210 selects a set of the residual signals and the spectrum parameters in accordance with the phoneme symbol string given by a sentence analysis.rhythm control section. Then, voice elements are generated by passing them through a synthesis filter 202, which is constructed of the selected residual signals and the selected spectrum parameters. Then, a pitch period control by a pitch synchronization waveform superimposing method and a duration length control are conducted in a rhythm control section 203 against the voice elements and synthesized voices are generated by connecting these voice elements an element connecting section 204.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テキスト音声合成
に適した音声合成方法に係り、特に音韻記号列・ピッチ
・音韻継続時間長などの情報から合成音声を生成する音
声合成方法に関する。The present invention relates to a speech synthesis method suitable for text speech synthesis, and more particularly to a speech synthesis method for generating a synthesized speech from information such as a phoneme symbol string, a pitch, and a phoneme duration.

【０００２】[0002]

【従来の技術】任意の文章から人工的に音声信号を作り
出すことをテキスト音声合成という。テキスト音声合成
は、一般的に言語処理部、音韻処理部および音声合成部
の３つの段階によって行われる。入力されたテキスト
は、まず言語処理部において形態素解析や構文解析など
が行われ、次に音韻処理部においてアクセントやイント
ネーシヨンの処理が行われて、音韻記号列・ピッチ・音
韻継続時間長などの情報が出力される。最後に、音声信
号合成部で音韻記号列・ピッチ・音韻継続時間長などの
情報から合成音声が生成される。2. Description of the Related Art Creating a speech signal artificially from an arbitrary sentence is called text-to-speech synthesis. Text-to-speech synthesis is generally performed in three stages: a language processing unit, a phonemic processing unit, and a speech synthesis unit. The input text is first subjected to morphological analysis and syntactic analysis in the language processing unit, and then to accent and intonation processing in the phonological processing unit, resulting in phonological symbol strings, pitch, phonological duration, etc. Is output. Finally, the speech signal synthesizer generates a synthesized speech from information such as a phoneme symbol string, pitch, and phoneme duration.

【０００３】このようなテキスト音声合成に用いる音声
合成方法は、任意の音韻記号列を任意の韻律で音声合成
することが可能な方法でなければならない。任意の音韻
記号列を音声として合成することができる音声合成方式
は、ＬＰＣ分析合成方式と波形編集方式に大別される。[0003] A speech synthesis method used for such text speech synthesis must be a method capable of synthesizing an arbitrary phoneme symbol string with an arbitrary prosody. A speech synthesis method capable of synthesizing an arbitrary phoneme symbol string as speech is roughly classified into an LPC analysis / synthesis method and a waveform editing method.

【０００４】ＬＰＣ分析合成方式は、例えば文献（１）
「伊藤、佐藤：“切り出し残差を用いた音声合成におけ
るピッチ制御法”、音響論2-7-18(1989-3)」に紹介され
ているように、音声信号にＬＰＣ分析を適用してＬＰＣ
スペクトルパラメータと残差信号を求め、残差信号のレ
ベルで韻律の制御および接続を行う方式である。この方
式は、ＬＰＣ係数の操作で声質の変更が容易であり、ま
た合成のための音声素片辞書サイズが比較的小さくて済
むという利点がある反面、合成音声の音質はいわゆる鼻
にかかった明瞭性に欠けたものとなり、不十分なもので
あった。The LPC analysis / synthesis method is described in, for example, Reference (1).
Applying LPC analysis to speech signals as introduced in "Ito and Sato:" Pitch control method in speech synthesis using segmented residuals ", Acoustic Theory 2-7-18 (1989-3)" LPC
In this method, a spectrum parameter and a residual signal are obtained, and prosody is controlled and connected at the level of the residual signal. This method has the advantage that the voice quality can be easily changed by manipulating the LPC coefficients and that the size of the speech unit dictionary for synthesis is relatively small, but the sound quality of the synthesized voice is so-called nose-like It was lacking in nature and was insufficient.

【０００５】一方、波形編集方式は、例えば文献（２）
「広川、箱田、佐藤：“波形編集型合成方式におけるス
ペクトル連続性を考慮した波形選択法”、音響論2-6-10
(1990-9)」、文献（３）「岩田、他：：“パソコン向け
ソフトウェア日本語テキスト音声合成”、音響論2-8-13
(1993-10）」、および文献（４）「小山、小泉：“ＶＣ
Ｖを基本単位とする波形規則合成方式の検討”、信学技
報SP96-8(1996-5)」などで紹介されているように、実音
声波形から切出した音声素片のピッチ周期や継続時間長
を変更して接続することで音声を合成する方式で、高音
質化が比較的容易であると考えられており、盛んに検討
が行われている。[0005] On the other hand, a waveform editing method is described in, for example, Reference (2).
"Hirokawa, Hakoda, Sato:" Waveform Selection Method Considering Spectrum Continuity in Waveform Editing Synthetic Method ", Acoustic Theory 2-6-10
(1990-9) ", Reference (3)" Iwata et al .: "Japanese Text-to-Speech Synthesis for Personal Computer Software", Acoustic Theory 2-8-13
(1993-10) "and Reference (4)" Koyama, Koizumi: "VC
As described in “Study of waveform rule synthesis method using V as a basic unit”, IEICE Technical Report SP96-8 (1996-5), etc., the pitch period and continuation of speech units extracted from actual speech waveforms This is a method of synthesizing voice by changing the time length and connecting, and it is considered that it is relatively easy to achieve high sound quality, and has been actively studied.

【０００６】さらに、高音質化のためには分析、合成な
どの信号処理を行わない方が良いとの立場から、音韻環
境及び韻律環境が一致する音声波形を自然音声のデータ
ベースから最長単位で接続する方式も提案されている
（文献（５）「N.Campbell andA.W.Black：“ＣＨＡＴ
Ｒ：自然音声波形接続型任意音声合成システム”、信学
技報SP96-7(1996-5)」。Further, from the standpoint that it is better not to perform signal processing such as analysis and synthesis in order to improve sound quality, a speech waveform having the same phonological environment and prosodic environment is connected in the longest unit from a natural speech database. (N. Campbell and A.W. Black: “CHAT”
R: Arbitrary speech synthesis system with natural speech waveform connection, IEICE Technical Report SP96-7 (1996-5).

【０００７】これらの方式は、分析合成方式より高音質
の合成音声を生成できるという利点を持つ反面、音声素
片辞書のサイズが大きくなるという問題がある。また、
スペクトルパラメータが陽に表現されていないため、声
質の変更などが難しいという問題がある。[0007] These systems have the advantage that synthesized speech with higher sound quality can be generated than the analysis and synthesis system, but have the problem that the size of the speech unit dictionary is increased. Also,
Since the spectral parameters are not explicitly expressed, there is a problem that it is difficult to change the voice quality.

【０００８】本発明は、上述した従来の問題点を解消す
べくなされたものであり、合成音声の音質が優れ、かつ
音声素片辞書のサイズがコンパクトで、声質の変更も容
易な音声合成方法を提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in order to solve the above-mentioned conventional problems, and has a superior speech quality of a synthesized speech, a compact speech unit dictionary size, and an easy-to-change speech quality. The purpose is to provide.

【０００９】[0009]

【課題を解決するための手段】上記の課題を解決するた
め、本発明による音声合成方法は、音声素片を残差信号
とＬＰＣ係数のようなスペクトルパラメータの形で表現
し、残差信号をスペクトルパラメータに従って構成され
る合成フィルタに通すことにより音声素片を作成し、こ
の音声素片に対して韻律制御を行い、韻律制御後の音声
素片を接続して合成音声を生成することを特徴とする。In order to solve the above problems, a speech synthesis method according to the present invention represents a speech unit in the form of a residual signal and spectral parameters such as LPC coefficients, and converts the residual signal. A speech unit is created by passing through a synthesis filter configured according to spectral parameters, prosody control is performed on the speech unit, and a speech unit after the prosody control is connected to generate a synthesized speech. And

【００１０】さらに具体的には、音声素片を残差信号と
スペクトルパラメータの形で表現して、残差信号とスペ
クトルパラメータの組を音声素片辞書として格納してお
き、与えられた音韻記号列に従って残差信号とスペクト
ルパラメータの組を選択し、選択された残差信号を選択
されたスペクトルパラメータに従って構成される合成フ
ィルタに通すことにより音声素片を作成し、この音声素
片に対して韻律制御を行い、韻律制御後の音声素片を接
続して合成音声信号を生成する。More specifically, a speech unit is expressed in the form of a residual signal and a spectrum parameter, and a set of the residual signal and the spectrum parameter is stored as a speech unit dictionary. A speech unit is created by selecting a set of a residual signal and a spectrum parameter according to a column and passing the selected residual signal through a synthesis filter configured according to the selected spectrum parameter. Prosody control is performed, and speech units after prosody control are connected to generate a synthesized speech signal.

【００１１】韻律制御に際しては、合成フィルタにより
得られる音声素片に対してピッチ同期波形重畳法を適用
することによりピッチ周期を制御することが好ましい。
韻律制御に際し、さらに音声素片の継続時間長を制御し
てもよい。In prosody control, it is preferable to control the pitch period by applying a pitch-synchronized waveform superposition method to a speech unit obtained by a synthesis filter.
In prosody control, the duration of the speech unit may be further controlled.

【００１２】このような本発明に基づく音声合成方法に
よると、従来の残差駆動方式の音声合成法では残差信号
のレベルで韻律の制御を行っていたのに対して、音声素
片のレベルで韻律の制御を行い、かつ韻律制御後の音声
素片を接続するため、波形編集方式と同等の音質の合成
音声が得られる。According to such a speech synthesis method according to the present invention, in the conventional residual drive speech synthesis method, the prosody is controlled at the level of the residual signal, whereas the level of the speech unit is controlled. To control the prosody and connect the speech units after the prosody control, so that a synthesized speech having a sound quality equivalent to that of the waveform editing method can be obtained.

【００１３】この場合、韻律制御におけるピッチ周期の
制御にピッチ同期波形重畳法を用いれば、さらに明瞭で
高音質の音声合成が可能となる。また、本発明では音声
素片辞書として用意する音声素片を残差信号とＬＰＣ係
数のようなスペクトルパラメータの組で表現するため、
音声素片辞書のサイズもコンパクトとなる。In this case, if the pitch-synchronized waveform superposition method is used for controlling the pitch period in the prosody control, a clearer and higher-quality sound can be synthesized. Further, in the present invention, since a speech unit prepared as a speech unit dictionary is represented by a set of a spectral parameter such as a residual signal and an LPC coefficient,
The size of the speech unit dictionary is also compact.

【００１４】さらに、このように音声素片をスペクトル
パラメータと残差信号の組で表現することによって、ス
ペクトルパラメータの操作により合成音声の声質を用意
に変更することが可能である。Further, by expressing a speech unit by a set of a spectrum parameter and a residual signal, it is possible to easily change the voice quality of a synthesized speech by operating the spectrum parameter.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を説明する。図１は、本発明による音声合成方
法をテキスト音声合成システムに適用した実施形態を示
すブロック図である。この音声合成システムは、大きく
分けて分析部１００と合成部２００とからなる。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment in which the speech synthesis method according to the present invention is applied to a text speech synthesis system. This speech synthesis system is roughly divided into an analysis unit 100 and a synthesis unit 200.

【００１６】分析部１００は、入力される音声波形から
ピッチ波形を切り出すピッチ波形切り出し部１０１と、
切り出されたパッチ波形のＬＰＣ分析（線形予測分析）
を行い、残差信号とスペクトルパラメータであるＬＰＣ
係数を抽出するＬＰＣ分析部１０２と、ＬＰＣ分析部１
０２により抽出された残差信号とＬＰＣ係数の組を音声
素片辞書として格納する残差信号記憶部１０３およびＬ
ＰＣ係数記憶部１０４からなる。The analyzing section 100 includes a pitch waveform extracting section 101 for extracting a pitch waveform from an input speech waveform,
LPC analysis of extracted patch waveforms (linear prediction analysis)
, And the residual signal and the spectral parameter LPC
LPC analysis unit 102 for extracting coefficients, and LPC analysis unit 1
02 and a residual signal storage unit 103 for storing a set of the residual signal and the LPC coefficient extracted as a speech unit dictionary.
It comprises a PC coefficient storage unit 104.

【００１７】一方、合成部２００は図示しない文解析・
韻律制御部でテキスト合成に供されるテキストを解析し
て得られる音韻記号列に従って、分析部１００における
残差信号記憶部１０３およびＬＰＣ係数記憶部１０４か
ら、個々の音韻記号に対応する組の残差信号とＬＰＣ係
数を選択して取り出す音声素片選択部２０１と、選択さ
れたＬＰＣ係数に従って構成され、選択された残差信号
を入力として音声素片を作成する合成フィルタ２０２
と、作成された音声素片に対して、文解析・韻律制御部
から与えられるピッチ周期および継続時間長の情報に従
って韻律の制御を行う韻律制御部２０３と、韻律制御後
の音声素片を接続して合成音声を生成する素片接続部２
０４からなる。On the other hand, the synthesizing unit 200 transmits a sentence
According to a phoneme symbol string obtained by analyzing a text to be subjected to text synthesis by the prosody control unit, the residual signal storage unit 103 and the LPC coefficient storage unit 104 of the analysis unit 100 store a set of residuals corresponding to each phoneme symbol. A speech unit selection unit 201 for selecting and extracting a difference signal and an LPC coefficient, and a synthesis filter 202 configured according to the selected LPC coefficient and generating a speech unit using the selected residual signal as an input.
And a prosody control unit 203 that controls the prosody of the generated speech unit according to the pitch period and duration information given from the sentence analysis and prosody control unit, and a speech unit after the prosody control. Unit connection unit 2 that generates synthesized speech
04.

【００１８】次に、図２に示すフローチャートを用い
て、分析部１００の詳細な処理手順を説明する。まず、
音声波形を分析部１００に入力する（ステップＳ１
１）。この音声波形としては、例えば後述するようにし
て作成された代表音声素片を用いる。Next, the detailed processing procedure of the analyzer 100 will be described with reference to the flowchart shown in FIG. First,
A voice waveform is input to the analyzer 100 (step S1).
1). As the speech waveform, for example, a representative speech unit created as described later is used.

【００１９】次に、ピッチ波形切り出し部１０１で入力
の音声波形にピッチ周期長の窓関数を掛けてピッチ周期
分の波形を切り出した後、ＬＰＣ分析部１０２でピッチ
同期ＬＰＣ分析を行う（ステップＳ１２〜Ｓ１３）。こ
の場合、窓関数により音声波形の離散的なスペクトルが
平滑化されるため、基本周波数の影響が低減されたスペ
クトル包絡を得ることができる。Next, a pitch waveform extracting section 101 multiplies an input speech waveform by a window function of a pitch cycle length to cut out a waveform corresponding to the pitch cycle, and then performs a pitch-synchronized LPC analysis in an LPC analyzing section 102 (step S12). To S13). In this case, since the discrete spectrum of the speech waveform is smoothed by the window function, a spectrum envelope in which the influence of the fundamental frequency is reduced can be obtained.

【００２０】ステップＳ１２でのＬＰＣ分析の結果、音
声素片がピッチ周期単位の残差信号とＬＰＣ係数の組で
表現される。これらのうち残差信号は残差信号記憶部１
０３に、ＬＰＣ係数はＬＰＣ係数記憶部１０４に、それ
ぞれ互いに対応付けられて音声素片辞書として格納され
る（ステップＳ１４）。As a result of the LPC analysis in step S12, the speech unit is represented by a set of the residual signal and the LPC coefficient in units of pitch period. Among these, the residual signal is stored in the residual signal storage unit 1.
03, the LPC coefficients are stored in the LPC coefficient storage unit 104 as a speech unit dictionary in association with each other (step S14).

【００２１】次に、図３に示すフローチャートを参照し
て合成部２００の詳細な処理手順を説明する。音声合成
に際しては、図示しない文解析・韻律制御部から音韻記
号列とピッチ周期および継続時間長（音韻継続時間長）
の情報が与えられる。まず、音韻記号列に従って、音声
素片辞書を構成している残差信号記憶部１０３とＬＰＣ
係数記憶部１０４から、選択部２０１で個々の音韻記号
に対応した残差信号とＬＰＣ係数の組を選択して読み出
す（ステップＳ２１）。Next, a detailed processing procedure of the synthesizing unit 200 will be described with reference to a flowchart shown in FIG. At the time of speech synthesis, a sentence analysis / prosody control unit (not shown) supplies a phoneme symbol string, a pitch period, and a duration (phoneme duration).
Is given. First, according to the phoneme symbol string, the residual signal storage unit 103 forming the speech unit dictionary and the LPC
From the coefficient storage unit 104, the selection unit 201 selects and reads a set of the residual signal and the LPC coefficient corresponding to each phoneme symbol (step S21).

【００２２】次に、ステップＳ２１で選択されたＬＰＣ
係数によって合成フィルタ２０２を構成し、この合成フ
ィルタ２０２にステップＳ２１で選択された残差信号を
入力することにより、音声素片を作成する（ステップＳ
２２〜Ｓ２３）。Next, the LPC selected in step S21
A speech unit is created by configuring the synthesis filter 202 with the coefficients and inputting the residual signal selected in step S21 to the synthesis filter 202 (step S21).
22-S23).

【００２３】次に、ステップＳ２３で作成された音声素
片に対して、文解析・韻律制御部から与えられるピッチ
周期と継続時間長の情報に従って韻律制御部２０３で韻
律制御、つまりピッチ周期の制御と継続時間長の制御を
行う。Next, the prosody control unit 203 controls the prosody, that is, the control of the pitch period, according to the pitch period and duration information provided by the sentence analysis and prosody control unit for the speech unit created in step S23. And the duration time is controlled.

【００２４】具体的には、ステップＳ２３で作成された
音声素片に対して、まず波形編集方式と同様にピッチ同
期波形重畳法（ＰＳＯＬＡ）を適用してピッチ周期の制
御を行う（ステップＳ２４）。ピッチ同期波形重畳法
は、例えば文献（６）「F.Charpentier and M.Stella：
“Diphone Synthesis Using an Overlap-add Techniqef
or Speech Waveforms concateration”,Proc.ICASSP 8
6,pp.2015-2018(1986)」に記載されている公知の手法で
あるが、本実施形態ではより高音質の音声合成を可能と
するため、以下のようにしてピッチ同期波形重畳法に基
づくピッチ周期の制御を行う。More specifically, the pitch period is controlled by applying the pitch-synchronized waveform superposition method (PSOLA) to the speech unit created in step S23 in the same manner as in the waveform editing method (step S24). . The pitch-synchronized waveform superposition method is described in, for example, reference (6) “F. Charpentier and M. Stella:
“Diphone Synthesis Using an Overlap-add Techniqef
or Speech Waveforms concateration ”, Proc.ICASSP 8
6, pp. 2015-2018 (1986) '', but in the present embodiment, in order to enable speech synthesis with higher sound quality, the pitch synchronous waveform superposition method is performed as follows. The pitch period is controlled based on the pitch.

【００２５】一般に、合成音声の音質は有声音の滑らか
さに負うところが大きい。そこで、本実施形態ではピッ
チ周期の変化をより滑らかにするために、与えられたピ
ッチ周期をサンプル単位で補間する。第ｊフレームと第
ｊ＋１フレームの中心時刻を各々ｔ₁ ，ｔ₂ とし、ピッ
チ周期をｐ₁ ，ｐ₂ とすると、ピッチ周期が線形に変化
する場合、時刻ｔにおけるピッチ周期ｐ(t) は次式で表
される。In general, the sound quality of synthesized speech largely depends on the smoothness of voiced sound. Therefore, in the present embodiment, in order to make the change in the pitch cycle smoother, the given pitch cycle is interpolated in units of samples. If the center times of the j-th frame and the j + 1-th frame are respectively t ₁ and t ₂ and the pitch periods are p ₁ and p ₂ , the pitch period p (t) at time t is It is expressed by an equation.

【００２６】[0026]

【数１】また、ｔ₁ からｔ₂ までのピッチマークの位置をｍ_k (k
＝1,2,…N)とすると、次式が成立する。(Equation 1) Also, the position of the pitch mark from t ₁ to t ₂ m _{k (k}
= 1, 2,... N), the following equation holds.

【００２７】[0027]

【数２】式（１）（２）から次式が得られる。(Equation 2) The following equations are obtained from the equations (1) and (2).

【００２８】[0028]

【数３】 (Equation 3)

【００２９】韻律制御部２０３におけるピッチ周期の制
御は、このようにして求められたピッチマークの位置を
基準にして、合成フィルタ２０２で作成された音声素片
を重畳しする。すなわち、例えば時間軸上で各ピッチマ
ークの位置に音声素片の先頭をそれぞれ配置して、それ
らを零信号に重畳する。この場合、各々のピッチマーク
の位置に対応する隣接した音声素片が重複している部分
は足し合わせられ、重複していない部分は元の音声素片
のままとなる。In the control of the pitch period in the prosody control unit 203, the speech unit created by the synthesis filter 202 is superimposed on the basis of the position of the pitch mark thus obtained. That is, for example, the head of a speech unit is arranged at the position of each pitch mark on the time axis, and these are superimposed on the zero signal. In this case, the overlapping portions of adjacent speech units corresponding to the positions of the respective pitch marks are added, and the non-overlapping portions remain as original speech units.

【００３０】韻律制御部２０２では、さらに継続時間長
の制御を行う（ステップＳ２５）。この継続時間長の制
御では、元の音声波形と合成音声波形の各々のピッチマ
ークをどのように対応付けるかが重要であるが、本実施
形態ではこの対応付けに際して時間的なマッピングを関
数で行う。この方法によると、マッピング関数を適切に
定義することによって、ピッチ波形の間引き・補間を接
続する音声素片の性質に応じて自由に制御することがで
きる。The prosody control unit 202 further controls the duration time (step S25). In the control of the duration, it is important how the pitch marks of the original speech waveform and the synthesized speech waveform are associated with each other. In this embodiment, temporal mapping is performed by a function in this association. According to this method, by properly defining the mapping function, it is possible to freely control the thinning / interpolation of the pitch waveform according to the nature of the speech unit to be connected.

【００３１】次に、このようにして韻律制御（ピッチ周
期と継続時間長の制御）がなされた音声素片どうしの接
続を行う（ステップＳ２６）。本実施形態では、接続部
における波形の不連続性に起因する歪みを低減するた
め、音声素片としてＣＶ、ＶＣ素片を用いており、母音
定常部で各音声素片を接続する。この際、接続する母音
のピッチ波形を全母音区間に渡って重み付き加算して接
続する。このようにして、任意の文章（テキスト）が音
声信号に変換された合成音声が生成される。Next, connection is performed between the speech units subjected to the prosody control (control of the pitch period and the duration) in this way (step S26). In the present embodiment, in order to reduce distortion due to discontinuity of the waveform in the connection unit, CV and VC units are used as speech units, and each vowel unit is connected by a vowel stationary unit. At this time, the pitch waveforms of the vowels to be connected are connected by weighted addition over the entire vowel section. In this way, a synthesized speech in which an arbitrary sentence (text) is converted into an audio signal is generated.

【００３２】次に、本発明に付随する音声素片の学習法
について説明する。従来、音声素片の作成は人手による
試行錯誤的手法に頼っており、熟練した研究者が長時間
かけて単音発声や無意味単語発声または連続発声された
音声データから音声素片を切り出しては、合成音声を評
価するという一連の作業を繰り返し行う必要があった。Next, a description will be given of a speech unit learning method associated with the present invention. Traditionally, speech units have been created by manual trial-and-error techniques. It was necessary to repeatedly perform a series of operations for evaluating the synthesized speech.

【００３３】一方、音声データベースから音声素片を自
動的に生成する方法として、例えば文献（７）「中嶌、
浜田“音響近況に基づくクラスタリングによる規則合成
法”、信学論D-II,vol.J-72-D-II,No.8,pp.1177-1179(1
989-8)」に開示された音素環境クラスタリング（ＣＯ
Ｃ：Context Oriented Clustering)法が知られている。
この方法は、スペクトルパラメータの分散に基づいて音
韻環境の拘束条件の下で音声データベースから切り出さ
れた音声素片をクラスタリングし、各クラスタのセント
ロイドを代表音声素片とする方法である。On the other hand, as a method for automatically generating a speech unit from a speech database, for example, reference (7) “Nakashima,
Hamada, "Rule Synthesis Method by Clustering Based on Acoustic Updates", IEICE D-II, vol.J-72-D-II, No.8, pp.1177-1179 (1
989-8) ”, the phoneme environment clustering (CO
C: Context Oriented Clustering) method is known.
This method is a method of clustering speech segments cut out from a speech database under the constraints of a phoneme environment based on the dispersion of spectral parameters, and using the centroid of each cluster as a representative speech segment.

【００３４】この音素環境クラスタリング法は、先見的
知識に頼らずに統計的評価基準に基づいて代表音声素片
を決定できるという特徴があるが、音声合成で問題とな
っているピッチ周期の制御に伴う歪みを考慮していない
ため、合成音声の音質は必ずしも十分なものとは言えな
い。This phoneme environment clustering method has a feature that a representative speech unit can be determined based on a statistical evaluation criterion without relying on a priori knowledge. However, it is difficult to control a pitch cycle which is a problem in speech synthesis. Since the accompanying distortion is not taken into account, the sound quality of the synthesized speech is not always sufficient.

【００３５】そこで、韻律制御（ピッチ周期と継続時間
長の制御）を行うことにより生じる歪みも含めて合成音
声の歪みを定義し、この歪みを最小化する代表音声素片
の学習法について説明する。Therefore, the distortion of the synthesized speech including the distortion caused by performing the prosody control (control of the pitch period and the duration) is defined, and a learning method of the representative speech unit for minimizing the distortion will be described. .

【００３６】図４に、本実施形態で用いる代表音声素片
の閉ループ学習システムのブロック図を示す。この学習
法は、実際には様々な合成器や合成単位に対して適用す
ることができるが、ここでは先に説明した音声合成シス
テムに用いるＣＶ、ＶＣ音声素片の学習に適用した場合
について述べる。これは学習によって音声素片を生成し
た後、合成フィルタのＬＰＣ係数と残差信号を求める。FIG. 4 is a block diagram of a representative speech unit closed loop learning system used in this embodiment. This learning method can be actually applied to various synthesizers and synthesis units, but here, a case where the learning method is applied to learning of CV and VC speech units used in the speech synthesis system described above will be described. . In this method, after speech units are generated by learning, LPC coefficients of a synthesis filter and a residual signal are obtained.

【００３７】学習に当たっては、まず事前準備として音
声合成単位の音声素片を音声データべース４０１から大
量に切り出し、これらを代表音声素片候補４０２とす
る。同時に、同様な方法で学習のためのトレーニングデ
ータ４０３を作成する。次に、代表音声素片候補のピッ
チ周期と継続時間長を分析して（４０４）、トレーニン
グデータ４０３をターゲットに代表音声素片候補のピッ
チ周期と継続時間長を分析して変更し（４０５）、音声
素片を合成する。このような方法で全ての代表音声素片
候補４０２と全てのトレーニングデータの組み合わせに
ついて、音声素片を生成する。In the learning, first, a large number of speech units in a speech synthesis unit are cut out from the speech database 401 as a preliminary preparation, and these are used as representative speech unit candidates 402. At the same time, training data 403 for learning is created by a similar method. Next, the pitch period and the duration of the representative speech unit candidate are analyzed (404), and the pitch period and the duration of the representative speech unit candidate are analyzed and changed using the training data 403 as a target (405). Synthesize speech units. In this way, speech units are generated for all combinations of the representative speech unit candidates 402 and all the training data.

【００３８】次に、生成された音声素片のトレーニング
データに対する歪みを計算で求めて評価し（４０５）、
全てのトレーニングデータに対する歪みの総和を最小に
する代表音声素片を探索して上述の代表音声素片の候補
から選択し（４０６）、これを代表素片とする。Next, the distortion of the generated speech unit with respect to the training data is calculated and evaluated (405).
A representative speech unit that minimizes the sum of distortions for all the training data is searched and selected from the above-described candidate representative speech units (406), and this is set as a representative unit.

【００３９】この学習法は、合成された音声素片の評価
結果を音声素片の学習にフィードバックするという意味
で、閉ループ学習と呼ぶ。以下に、この学習法で重要に
なる歪み尺度と代表音声素片の選択法について、具体的
な一例を述べる。This learning method is called closed-loop learning in the sense that the evaluation result of the synthesized speech unit is fed back to the learning of the speech unit. Hereinafter, a specific example of a distortion measure and a method of selecting a representative speech unit, which are important in this learning method, will be described.

【００４０】（歪み尺度）学習の歪み尺度は、主観評価
の結果を良く反映するものである必要がある。また、合
成音声のパワーは音声合成システムで制御されることか
ら、代表音声素片はパワーが正規化されたレベルで評価
する必要がある。このようなことを考慮して、合成音声
素片の歪みを次式で定義する。(Distortion scale) The distortion scale of learning needs to reflect the result of the subjective evaluation well. Further, since the power of the synthesized speech is controlled by the speech synthesis system, it is necessary to evaluate the representative speech unit at a level where the power is normalized. In consideration of the above, the distortion of the synthesized speech unit is defined by the following equation.

【００４１】[0041]

【数４】 (Equation 4)

【００４２】ここで、ｒ_j はトレーニングデータを表
し、ｓ_ijはｒ_j を目標にした代表音声素片候補ｕ_i によ
る合成音声素片を表す。（代表音声素片の選択）合成単位当たりの代表音声素片
数をｎ、代表音声素片候補数をＮとすると、代表音声素
片の選択はＮ個の候補からｎ個を選ぶ組み合わせの中か
ら次のコス卜関数を最小化する代表音声素片の組を一組
探索する問題となる。[0042] Here, r _j represents the training data, s _ij denotes a synthesized speech segment according to a representative speech unit candidate u _i that the goal of r _j. (Selection of representative speech unit) Assuming that the number of representative speech units per synthesis unit is n and the number of candidate representative speech units is N, the selection of the representative speech unit is a combination of selecting n out of N candidates. Then, there is a problem of searching for a set of representative speech units that minimizes the next cost function.

【００４３】[0043]

【数５】 (Equation 5)

【００４４】ここで、Ｍはトレーニングデータの数であ
る。式（９）のコスト関数を最小化する代表音声素片の
組が求まると、全トレーニングデータを代表音声素片に
対応するクラスタにクラスタリングすることができる。Here, M is the number of training data. When a set of representative speech units that minimizes the cost function of Expression (9) is obtained, all training data can be clustered into clusters corresponding to the representative speech units.

【００４５】図５に、４個の代表音声素片候補から２個
の代表音声素片を選択する場合の例を示す。この例で
は、ｕ₁ 〜ｕ₄ の任意の二個の組み合わせの中で、ｕ₂
とｕ₃の組み合わせのコスト関数が最小となる。この結
果、ｕ₂ とｕ₃ が代表音声素片として選択される。FIG. 5 shows an example in which two representative speech units are selected from four representative speech unit candidates. In this example, among any two combinations of u _{1 to} u ₄ , u ₂
And the cost function of the combination of u ₃ is minimized. As a result, u ₂ and u ₃ are selected as representative speech units.

【００４６】（評価実験）ＣＶ、ＶＣのｄｉｐｈｏｎｅ
を合成単位として、各合成単位に対して上述の方法で１
個の代表音声素片を作成する実験を行った。視察により
音韻ラベルが付けられた音声データベースからトレーニ
ングに用いる音声素片データと代表音声素片候補を切り
出し、前述した閉ループ学習法で計３０２個のＣＶ，Ｖ
Ｃ代表音声素片を作成した。学習に要した時間はＳｕｎ
−Ｕｌｔｒａ２で約１．５時間であった。(Evaluation experiment) Diphone of CV and VC
Is a synthesis unit, and 1 is calculated for each synthesis unit by the above-described method.
An experiment to create representative speech units was performed. Speech unit data and representative speech unit candidates used for training are cut out from a speech database to which a phoneme label has been attached by inspection, and a total of 302 CVs and Vs are obtained by the closed loop learning method described above.
C representative speech unit was created. The time required for learning was Sun
-About 1.5 hours with Ultra2.

【００４７】図６は、合成単位（ＣＶ，ＶＣ）当たりの
音声素片数を増加させた場合のコスト関数の値を示して
おり、この図から音声素片数の増加とともに合成音声の
歪みが単調に減少していることが分かる。FIG. 6 shows the value of the cost function when the number of speech units per synthesis unit (CV, VC) is increased. From this figure, as the number of speech units increases, the distortion of the synthesized speech decreases. It turns out that it decreases monotonically.

【００４８】従来から、パワーやピッチにより音声素片
を使い分けることにより合成音の音質が向上することは
知られている。しかし、従来の試行錯誤による方法で
は、代表音声素片の作成に多大な労力と時間を要し、代
表音声素片の数を増やすことは容易ではなかった。It has been known that the sound quality of a synthesized sound is improved by selectively using speech units according to power and pitch. However, according to the conventional method based on trial and error, it takes a lot of effort and time to create a representative speech unit, and it is not easy to increase the number of representative speech units.

【００４９】これに対し、上述した閉ループ学習法によ
れば、ラベリングされた音声データが与えられれば短時
間で自動的に音声素片の作成ができ、任意の数の代表音
声素片を生成することが容易である。しかも、パワーや
ピッチといった先見的な知識で音声素片の選択を行うの
ではなく、合成音声の歪みの尺度で選択の規則を作成す
ることが可能である。すなわち、トレーニングデータを
選択された代表音声素片のクラスタにクラスタリング
し、クラスタ内で共通する要因を抽出することにより音
声素片選択の規則を生成することかできる。On the other hand, according to the above-described closed-loop learning method, given labeled speech data, speech segments can be automatically created in a short time, and an arbitrary number of representative speech segments are generated. It is easy. Moreover, it is possible to create a rule for selection based on a measure of distortion of synthesized speech, instead of selecting speech units based on a priori knowledge such as power and pitch. In other words, it is possible to generate a speech unit selection rule by clustering the training data into clusters of the selected representative speech units and extracting factors common to the clusters.

【００５０】次に、上述した音声合成システムで得られ
た合成音声の音質評価を行った。作成した代表音声素片
を図１の音声入力として分析部に与え、ピッチ波形切り
出し部１０１およびＬＰＣ分析部１０２を介して残差信
号とＬＰＣ係数に分解した形で残差信号記憶部１０３と
ＬＰＣ係数記憶部１０４に音声素片辞書として蓄積し
た。蓄積に当たっては、ベクトル−スカラ量子化の手法
を適用して、残差信号とＬＰＣ係数を符号化した。この
結果、データ量は一話者当たり約１５０ｋバイトと、波
形編集方式に比べて１／１０〜１／２０の非常にコンパ
クトなものとなっている。従って、本実施形態の音声合
成システムはＰＤＡ等の携帯情報端末やカーナビゲーシ
ョンシステム等へ組み込みことも容易である。Next, the sound quality of the synthesized speech obtained by the above-described speech synthesis system was evaluated. The generated representative speech unit is supplied to the analysis unit as the speech input of FIG. 1 and is divided into a residual signal and an LPC coefficient via the pitch waveform cutout unit 101 and the LPC analysis unit 102, and the LPC coefficient is stored in the residual signal storage unit 103. It was stored in the coefficient storage unit 104 as a speech unit dictionary. Upon accumulation, the residual signal and the LPC coefficient were encoded by applying a vector-scalar quantization technique. As a result, the data amount is about 150 kbytes per speaker, which is very compact, 1/10 to 1/20 as compared with the waveform editing method. Therefore, the speech synthesis system of the present embodiment can be easily incorporated into a portable information terminal such as a PDA, a car navigation system, or the like.

【００５１】大学生７名を含む計１０名（男女同数）の
一般の被験者による７段階（−３：非常に悪い〜＋３：
非常によい）の主観評価の結果、本実施形態の音声合成
システムで得られた合成音声の音質は、従来のケプスト
ラム合成方式による音声合成システムに比較して、男女
話者及び各種文章の平均で２．５ポイント向上し、明瞭
感が大幅に向上するとともに、ソフトでより肉声に近い
音質になったとの評価が被験者から得られた。Seven stages (−3: very bad to +3 :) by a total of 10 (same number of men and women) general subjects including 7 university students
As a result of the subjective evaluation of (very good), the sound quality of the synthesized speech obtained by the speech synthesis system of the present embodiment is, on average, the average for male and female speakers and various sentences compared to the conventional cepstrum synthesis type speech synthesis system. The subject obtained an evaluation that the sound quality was improved by 2.5 points, the clarity was significantly improved, and the sound quality was softer and closer to real voice.

【００５２】[0052]

【発明の効果】以上説明したように、本発明の音声合成
方法によれば、音声素片を残差信号とＬＰＣ係数のよう
なスペクトルパラメータの組で表現し、残差信号とスペ
クトルパラメータで生成される音声素片に対して音律の
制御を行っているため、明瞭で高音質の合成音声を生成
できるとともに、スペクトルパラメータの操作により声
質の変更が容易であり、さらに音声素片辞書のサイズも
コンパクトにすることができる。As described above, according to the speech synthesis method of the present invention, a speech unit is represented by a set of a residual signal and a spectral parameter such as an LPC coefficient, and is generated by the residual signal and the spectral parameter. Control of the tone of the speech unit to be generated, it is possible to generate clear and high-quality synthesized speech, and it is easy to change the voice quality by manipulating the spectrum parameters. It can be compact.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声合成システムの
構成を示すブロック図FIG. 1 is a block diagram showing a configuration of a speech synthesis system according to an embodiment of the present invention.

【図２】同実施形態における分析側の処理手順を示すフ
ローチャートFIG. 2 is a flowchart showing a processing procedure on the analysis side in the embodiment.

【図３】同実施形態における合成側の処理手順を示すフ
ローチャートFIG. 3 is a flowchart showing a processing procedure on the synthesis side according to the embodiment;

【図４】代表音声素片の閉ループ学習システムを説明す
るためのブロック図FIG. 4 is a block diagram for explaining a closed-loop learning system for representative speech units.

【図５】合成音声素片の歪みに基づく代表音声素片選択
の例を示す図FIG. 5 is a diagram showing an example of representative speech unit selection based on distortion of a synthesized speech unit.

【図６】代表音声素片の素片数とコスト関数の関係を示
す図FIG. 6 is a diagram showing the relationship between the number of representative speech segments and a cost function.

[Explanation of symbols]

１００…音声分析部１０１…ピッチ波形切出し部１０２…ＬＰＣ分析部１０３…残差信号記憶部１０４…ＬＰＣ係数記憶部２００…音声合成部２０１…選択部２０２…ＬＰＣ合成フィルタ２０３…韻律制御部２０４…音声素片接続部 Reference Signs List 100 voice analysis unit 101 pitch waveform cutout unit 102 LPC analysis unit 103 residual signal storage unit 104 LPC coefficient storage unit 200 voice synthesis unit 201 selection unit 202 LPC synthesis filter 203 prosody control unit 204 Voice unit connection

Claims

[Claims]

A speech unit is expressed in the form of a residual signal and a spectrum parameter, and the residual signal is passed through a synthesis filter configured according to the spectrum parameter to produce a speech unit. A speech synthesis method comprising: performing prosody control using the speech units after the prosody control to generate synthesized speech.

2. A speech unit is represented in the form of a residual signal and a spectrum parameter, a set of the residual signal and the spectrum parameter is stored as a speech unit dictionary, and the residual is stored in accordance with a given phoneme symbol string. A speech unit is created by selecting a set of a difference signal and a spectrum parameter, and passing the selected residual signal through a synthesis filter configured according to the selected spectrum parameter, and performing prosody control on the speech unit. A speech synthesis method comprising: connecting speech units after prosody control to generate synthesized speech.

3. The speech according to claim 1, wherein the pitch period is controlled by applying a pitch-synchronized waveform superposition method to speech segments obtained by the synthesis filter during the prosody control. Synthesis method.

4. A speech synthesis method according to claim 3, wherein said prosody control further controls a duration of a speech unit.