JPH0247700A

JPH0247700A - Speech synthesizing method

Info

Publication number: JPH0247700A
Application number: JP63197851A
Authority: JP
Inventors: Tetsuo Umeda; 梅田　哲夫; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1988-08-10
Filing date: 1988-08-10
Publication date: 1990-02-16
Anticipated expiration: 2014-06-14
Also published as: JP2904279B2

Abstract

PURPOSE:To hold clearness and naturalness by varying the spectral envelope of a unit speech waveform according to 1st and 2nd spectral variation components. CONSTITUTION:A speech frequency correction part 4 obtains the pitch frequency of a voiced section and its time track from a unit voice obtained by a unit voice selection part 2. A formant correction part 6 determines formant as to the voiced section according to the shape (resonance frequency) of the spectral envelope and finds its track. Then a frequency characteristic variation part 8 varies the spectral envelope of the waveform which is expanded or compressed according to the variation of a pitch frequency according to the spectral variation found by the formant correction part and the spectral variation found by the vocal chord frequency correction part. Consequently, the phoneme and naturalness of each unit speech are maintained and a speech of high quality to which phoneme information is added is synthesized.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は出力音声情報に基づき予め記憶された音声を合
成することによって出力する音声合成方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech synthesis method for outputting by synthesizing pre-stored speech based on output speech information.

方式、また、■音素、音節などの単位音声として記憶さ
れたパラメータと、合成時に所定の規則によって生成さ
れるアクセント、イントネーション等の韻律情報とから
合成する規則合成方式等の音声合成方式が知られていた
。There are also known speech synthesis methods such as the rule synthesis method, which synthesizes from parameters stored as unit speech such as phonemes and syllables and prosodic information such as accent and intonation generated according to predetermined rules during synthesis. was.

［発明の概要］本発明は予め記憶された音声を接続し有意味な音声を出
力するに際して、接続されたことによる、および韻律情
報が付加されたことによる合成音声への影響を考慮し、
合成音声の明瞭性や自然性を損なわないようにした音声
合成方法である。[Summary of the Invention] The present invention, when connecting pre-stored voices and outputting meaningful voices, takes into account the influence on the synthesized voice due to the connection and the addition of prosodic information,
This is a speech synthesis method that does not impair the clarity and naturalness of synthesized speech.

Ｅ従来の技術】従来、この種の技術においては、■予め録音しておいた
単語や文節の音声を接続して再生する録音１ＪＡ３Ｊ、
方式や、■波形をいったん分析して得られるパラメータ
を記録しておき、再生時にこのパラメータによって合成
器を制御するパラメータ編集［発明が解決しようとする
課題］しかしながら、■と■の方式においては、出力する音声
を予め人が発声して登録しておいたものの中から選んで
出力するので登録した音声の中では音質が保てるが、音
声と音声をそのまま接続して出力するために声帯振動（
ピッチ）周波数やホルマントの不連続を生じ、不自然な
音声になっていた。E. Prior Art Conventionally, in this type of technology, ■Recording 1JA3J, which connects and reproduces the sounds of words and phrases recorded in advance;
Parameter editing that records the parameters obtained by once analyzing the waveform and controls the synthesizer using these parameters during playback [Problems to be solved by the invention] However, in the methods of ■ and ■, Since the output voice is selected from among the voices uttered and registered by a person in advance, the sound quality can be maintained among the registered voices, but in order to connect and output the voices as they are, vocal cord vibration (
Pitch) frequency and formant discontinuity occurred, resulting in unnatural sound.

また、■の方式は、合成時にピッチ周波数に相当するイ
ンパルスやホワイトノイズあるいは推定した声帯波形を
音源として声道特性フィルターに通したものを出力する
方式であり、これらの場合でもアクセントやイントネー
ションを変化させると、すなわち、音声のピッチ周期を
変化させるとホルマントも同時に変化し、音韻的に不明
瞭な音声となったり、さらに合成する単位音声と単位音
声との接続点での不自然さが生じていた。In addition, method (■) outputs a sound source that has been passed through a vocal tract characteristic filter, using an impulse corresponding to the pitch frequency, white noise, or an estimated vocal cord waveform as a sound source during synthesis, and even in these cases, it is possible to change the accent and intonation. In other words, when the pitch period of speech changes, the formant changes at the same time, resulting in phonologically unclear speech, and unnaturalness at the connection point between the unit speech to be synthesized. Ta.

そこで本発明の目的は上述した従来の問題点を解消し、
音声の合成単位を細かくしたり、イントネーション等の
韻律情報を付加するなど、合成音声の音質をよくするた
めにピッチやホルマントの制御をきめ細かくした場合に
も、合成された音声の音韻性を保ち、人間の音声として
の自然性を有した音声合成方法を提供することにある。Therefore, the purpose of the present invention is to solve the above-mentioned conventional problems,
Even when fine-grained pitch and formant control is used to improve the quality of synthesized speech, such as by making the units of speech synthesis smaller or adding prosodic information such as intonation, the phonology of the synthesized speech can be maintained. An object of the present invention is to provide a voice synthesis method that has the naturalness of human voice.

また、本発明の他の目的は、イントネーション等の韻律
性やホルマント周波数などに基づく音韻性の制御を可能
にすることによって人間の言語音声に対する知覚特性の
測定方法を提供することにある。Another object of the present invention is to provide a method for measuring the perceptual characteristics of human language speech by making it possible to control phonology based on prosody such as intonation and formant frequency.

［課題を解決するための手段Ｊそのために、本発明では、出力音声情報に基づき単位音
声情報および韻律情報を定め、予め記憶された単位音声
データの中から定められた単位音声情報に基づき、単位
音声に対応する単位音声データを選択し、選択された単
位音声データからピッチ周期、スペクトル包絡、ホルマ
ント軌跡および単位音声波形の各々を算出または抽出し
、算出または抽出された単位音声波形を接続するため、
および韻律情報を付加するために、算出または抽出され
たピッチ周期を変更し、変更されたピッチ周期において
ピッチ変更によるスペクトル包絡を算出し、ピッチ変更
によるスペクトル包絡と算出または抽出されたスペクト
ル包絡とに基づき第１のスペクトル変化分を算出し、算
出または抽出された単位音声波形を接続するために算出
または抽出されたホルマント軌跡を変更し、変更された
ホルマント軌跡に基づいてホルマント変更によるスペク
トル包絡を算出し、ホルマント変更によるスペクトル包
絡と算出または抽出されたスペクトル包絡とに基づき第
２のスペクトル変化分を算出し、第１および第２のスペ
クトル変化分に基づきピッチ周期の変更にかかる単位音
声波形のスペクトル包絡を変更し、スペクトル包絡を変
更した車位音声波形を接続した後、接続された音声を出
力することを特徴とする。[Means for Solving the Problems J] To this end, in the present invention, unit speech information and prosody information are determined based on output speech information, and unit speech information is determined based on unit speech information determined from unit speech data stored in advance. To select unit voice data corresponding to voice, calculate or extract each of the pitch period, spectral envelope, formant locus, and unit voice waveform from the selected unit voice data, and connect the calculated or extracted unit voice waveforms. ,
and to add prosodic information, the calculated or extracted pitch period is changed, the spectral envelope due to the pitch change is calculated in the changed pitch period, and the spectral envelope due to the pitch change and the calculated or extracted spectral envelope are calculate a first spectral change based on the calculated or extracted formant locus to connect the calculated or extracted unit speech waveforms, and calculate a spectral envelope due to the formant change based on the changed formant locus. Then, a second spectral change is calculated based on the spectral envelope due to the formant change and the calculated or extracted spectral envelope, and the spectrum of the unit speech waveform related to the pitch period change is calculated based on the first and second spectral changes. The present invention is characterized in that after changing the envelope and connecting the vehicle position audio waveform with the changed spectrum envelope, the connected audio is output.

［作　用］以上の構成によれば各単位音声の音韻性や自然性を保っ
たまま、有意味な音声に必要なイントネーションが付加
された高音質の音声合成が可能となる。[Operation] According to the above configuration, it is possible to synthesize high-quality speech in which intonation necessary for meaningful speech is added while maintaining the phonology and naturalness of each unit speech.

［実施例］以下、図面を参照して本発明の実施例を詳細に説明する
。[Example] Hereinafter, an example of the present invention will be described in detail with reference to the drawings.

第１図は本発明の一実施例を示す音声合成システムのブ
ロック図である。図において、２は単位音声選択部、４
は声帯周波数補正部、６はホルマント補正部、８は周波
数特性変更部であり、各部は電子計算機内に構成され、
この構成によってＲＯＭ、ＲＡＭあるいはディスクメモ
リ等のメモリを併用しながら音声合成の処理が実行され
る。FIG. 1 is a block diagram of a speech synthesis system showing one embodiment of the present invention. In the figure, 2 is a unit voice selection section, 4
6 is a vocal cord frequency correction unit, 6 is a formant correction unit, and 8 is a frequency characteristic change unit, each of which is configured in an electronic computer,
With this configuration, speech synthesis processing is performed using memory such as ROM, RAM, or disk memory.

文章等の出力音声情報が単位音声選択部２へ人力される
と、出力音声情報はまず、離散的な言語情報としての音
韻記号列に変換されるとともに音声の継続時間長、アク
セント、イントネーション、ポーズ等の韻律情報が決定
される。さらに、予め記憶されている単位音声の集合中
から決定された音韻記号列を構成する単位音声の列が選
択される。When output speech information such as a sentence is manually input to the unit speech selection section 2, the output speech information is first converted into a phoneme symbol string as discrete linguistic information, as well as the duration of the speech, accent, intonation, and pause. etc., is determined. Furthermore, a string of unit sounds constituting the determined phoneme symbol string is selected from a pre-stored set of unit sounds.

なお、出力音声情報に基づいて音韻記号列を定める代わ
りに、単語や文字等を定め、これに基づき単位音声列を
選択するようにしてもよい。また、記憶される単位音声
は、本例では波形としたが、単位音声にかかる線形予測
係数またはピッチやホルマントの軌跡等を単位音声のデ
ータとして記憶することにより、以下に示すピッチ周期
、スペクトル包絡、ホルマント軌跡の算出を行なわず、
直接抽出するようにしてもよい。Note that instead of determining the phoneme symbol string based on the output speech information, words, characters, etc. may be determined, and the unit speech string may be selected based on this. In addition, although the unit speech to be stored is a waveform in this example, by storing linear prediction coefficients or pitch and formant trajectories related to the unit speech as data of the unit speech, the pitch period and spectral envelope shown below can be stored. , without calculating the formant locus,
It may also be extracted directly.

次に、声帯周波数補正部４において、単位音声選択部２
で得られた単位音声から、有声音区間を判別し、有声音
区間のピッチ周波数およ゛びその時間軌跡を得る。さら
に、前後の単位音声のピッチ周波数軌跡を参照し、軌跡
が滑らかにつながるようピッチ周波数に変更を加えると
共に、アクセント、イントネーション等の成分を更に変
更成分として加え、この新たなピッチ周波数に応じて各
ピッチ周期毎に波形の継続時間長を伸縮する。これによ
りピッチ周波数が滑らかにつながると共に韻律情報とし
てのアクセント、イントネーション等が制御される。ま
たこの処理によって変化したスペクトル包絡の変化分を
求める。Next, in the vocal cord frequency correction section 4, the unit voice selection section 2
From the unit speech obtained in step 1, the voiced sound section is determined, and the pitch frequency of the voiced sound section and its time trajectory are obtained. Furthermore, by referring to the pitch frequency locus of the preceding and succeeding unit voices, the pitch frequency is changed so that the loci are smoothly connected, and components such as accent and intonation are further added as changed components, and each voice is adjusted according to this new pitch frequency. The duration of the waveform is expanded or contracted for each pitch period. As a result, the pitch frequencies are smoothly connected, and the accent, intonation, etc. as prosodic information are controlled. Also, the amount of change in the spectral envelope that has changed due to this processing is determined.

ホルマント補正部６では有声音区間についてスペクトル
包絡の形（共振周波数）からホルマントを決定し、その
軌跡を求める。次に前後の単位音声のホルマント周波数
軌跡を参照し滑らかにつながるようホルマント周波数に
変更を加える。また、元の音声のスペクトル包絡からの
、ホルマント変更によって生じた変化分を求める。The formant correction unit 6 determines the formant for the voiced sound section from the shape of the spectrum envelope (resonance frequency), and obtains its locus. Next, by referring to the formant frequency trajectories of the preceding and succeeding unit voices, changes are made to the formant frequencies so that they are smoothly connected. Also, the amount of change caused by the formant change from the spectral envelope of the original speech is determined.

周波数特性変更部８においては、ピッチ周波数の変更に
応じて伸縮された波形に対し、ホルマント補正部で求め
たスペクトル変化分と声帯周波数補正部で求めたスペク
トル変化分に応じてスペクトル包絡を変更する。The frequency characteristic changing section 8 changes the spectral envelope of the waveform expanded and contracted according to the pitch frequency change according to the spectral change obtained by the formant correction section and the spectral change obtained by the vocal fold frequency correction section. .

上記各部における処理の詳細を第２図（Ａ）および（Ｂ
）　　に示すブロック図およびフローチャートを参照し
ながら説明する。第２図（八）は第１図に示した構成の
詳細を示し、単位音声選択部２は単位音声蓄積部２２お
よび音声選択部２４によって、声帯周波数補正部４はピ
ッチ抽出制御部４２、波形伸縮部４４、スペクトル包絡
抽出部４６および補正分抽山部４８によって、ホルマン
ト補正部６はスペクトル包絡抽出部６２、スペクトル包
絡制御部６４および補正分抽山部６６によって、さらに
周波数特性変更部８はＦＦ７部８２、スペクトル包絡変
更部８４およびＩ　ＦＦ７部８６に、よって、それぞれ
構成される。また、第２図（Ｂ）のフローチャートにお
ける各ステップの左側に付した番号は第２図（Ａ）の各
部の番号を示し、該当のステップが付された番号で示さ
れた部においてその処理が実行されることを表わす。The details of the processing in each part above are shown in Figures 2 (A) and (B).
) will be explained with reference to the block diagram and flowchart shown in FIG. FIG. 2 (8) shows details of the configuration shown in FIG. The formant correction unit 6 is controlled by the expansion/contraction unit 44, the spectrum envelope extraction unit 46, and the correction extraction unit 48, the formant correction unit 6 is controlled by the spectrum envelope extraction unit 62, the spectrum envelope control unit 64, and the correction extraction unit 66, and the frequency characteristic change unit 8 is The FF7 section 82, the spectral envelope changing section 84, and the IFF7 section 86 are each configured. In addition, the numbers attached to the left side of each step in the flowchart of Figure 2 (B) indicate the number of each part in Figure 2 (A), and the processing is performed in the part indicated by the number with the corresponding step. Indicates that it will be executed.

以上の構成において、単位音声蓄積部２２には、母音−
子音−母音といった音素の並びの組で、予め発声した音
声を変換ビット数１ｚｂｔｔ　、　ｍ本化周波数１５ｋ
ｌｌｚで＾／Ｄ変換したものが単位音声として蓄積しで
ある。ここで、音声選択部２４では入力された文章等の
出力音声情報に基づき所定の音韻規則に従って音韻記号
列が決定され、さらに韻律規則に従って継続時間長、ア
クセント、イントネーション、ポーズ等が決定される。In the above configuration, the unit voice storage section 22 has vowels -
Conversion of pre-pronounced speech using a set of phoneme sequences such as consonants and vowels. Number of bits: 1zbtt, m frequency: 15k
What is converted into ^/D by llz is stored as a unit voice. Here, the speech selection unit 24 determines a phonetic symbol string according to predetermined phonetic rules based on output audio information such as an input sentence, and further determines duration, accent, intonation, pause, etc. according to prosody rules.

また、決定された音韻記号列に基づいて単位音声蓄積部
２２から該当する単位音声が選択される。Furthermore, the corresponding unit speech is selected from the unit speech storage section 22 based on the determined phoneme symbol string.

引き出された各単位音声は、ピッチ抽出制御部４２にお
いて音声パワーの有無に基づき有音区間と無音区間の判
別が行なわれ、次に有音区間の音声に対し１次の相関係
数と零交差数を求め、無声子音区間と有声音区間の判別
を行う。これは音声の中の高域成分を１次の相関係数と
零交差数の両方を調べることによって確実な判別を行う
ためである。The pitch extraction control unit 42 determines whether each extracted unit voice is a voiced section or a silent section based on the presence or absence of voice power, and then compares the voice in the voiced section with a first-order correlation coefficient and a zero crossing. The number is calculated and the voiced consonant interval and voiced consonant interval are distinguished. This is to ensure reliable discrimination of high-frequency components in speech by examining both the first-order correlation coefficient and the number of zero crossings.

ここで、判別された無音区間の時間長および無声子音区
間の波形はそのままメモリーに記録しておく。Here, the determined time length of the silent section and the waveform of the silent consonant section are recorded as they are in the memory.

さらに、ピッチ抽出制御部４２において有声音区間にお
ける音声波形に対していわゆる声導逆フィルタを用いて
線形予測分析を行い残差波形を得、この残差波形に相関
分析を行うことにより相関のピークの間隔からピッチ周
期を求める。これを単位音声上の有声音区間全体に行う
。Furthermore, the pitch extraction control unit 42 performs linear predictive analysis on the speech waveform in the voiced section using a so-called voice guidance inverse filter to obtain a residual waveform, and performs correlation analysis on this residual waveform to peak the correlation. Find the pitch period from the interval. This is performed for the entire voiced sound section of the unit speech.

次に、求められたピッチ周期のそれぞれについて、接続
される直前の（過去の時刻の）単位音声との接続が聴感
上滑らかに接続され、かつ文節や文章全体としてアクセ
ント、イントネーションが整うように変更を加え、新た
なピッチ周期列を算出する。Next, for each of the pitch periods found, changes are made so that the connection with the unit voice immediately before being connected (from a past time) is audibly smooth, and the accent and intonation are correct for the entire phrase or sentence. is added to calculate a new pitch period sequence.

すなわち、まず、求められた単位音声におけるピッチ周
期全体の平均ピッチ周期を、人間の＠感を考慮して相乗
平均ピッチとして求める。That is, first, the average pitch period of the entire pitch period in the determined unit voice is determined as a geometric mean pitch, taking into consideration the human @ feeling.

ここで、Ｐｎをｎ番目のピッチ周期、単位音声における
全ピッチ数をＬとすると、平均ピッチ周期ＰａｖｅはＰａｖｅ−（Ｐ、ｘＰ２ｘ−ｘＰＬ）”’と表わされ、
直前の単位音声の平均ピッチ周期をＰａｖｅｌ　とする
とき、平均ピッチ周期調整分Ｒを＋１−Ｐａｖｅｌ／Ｐ
ａｖｅとする。また、アクセント規則とイントネーショ
ン規則から算出される周期変更係数分をＱｎとし、さら
に第３図に示すように平均ピッチ周期を調整した場合の
ピッチ周期列のＬ個のピッチに対し、次の式で示される
係数Ｓｎによって調整を行い、ピッチ周期列が滑らかに
接続するようにする。Here, if Pn is the n-th pitch period and the total number of pitches in a unit voice is L, the average pitch period Pave is expressed as Pave-(P, xP2x-xPL)"',
When the average pitch period of the immediately preceding unit voice is Pavel, the average pitch period adjustment R is +1-Pavel/P
ave. In addition, let Qn be the period change coefficient calculated from the accent rule and intonation rule, and then use the following formula for L pitches in the pitch period sequence when the average pitch period is adjusted as shown in Figure 3. Adjustments are made using the indicated coefficient Sn so that the pitch period series are smoothly connected.

＝　１　　　　　　　　　　　　　　　　　（Ｌ／’２
≦ロ　≦Ｌ）ここで、Ｐ′、は直前の（過去の）単位音
声の修正後のピッチ周期列の最後のピッチ周期である。= 1 (L/'2
≦B≦L) Here, P' is the last pitch period of the corrected pitch period sequence of the immediately preceding (past) unit voice.

以上の各調整を総合して、Ｉｌｎを総合した調整分とす
ると、Ｒｎ−Ｒ−Ｑｎ−５ｎと表わされ、Ｌ個の各ピッ
チ周期毎にＰｎをＱｎ倍すれば新しいピッチ周期情報が
得られる。If the above adjustments are combined and Iln is the total adjustment amount, it is expressed as Rn-R-Qn-5n, and new pitch period information can be obtained by multiplying Pn by Qn for each of L pitch periods. It will be done.

このときスペクトル包絡抽出部６２においては。At this time, in the spectrum envelope extraction section 62.

原音声のスペクトル包絡を求める。すなわち、第４図に
示すように原単位音声波形から波形のレベルが急に大き
くなる点の直前をピッチの開始点とし、ピッチ抽出制御
部４２で最初に求めたピッチ周期に基づき、次のピッチ
の開始点の１標本手前を終了点として１つのピッチ区間
を定め、■ピッチ区間の中心を分析窓の中心として２０
ｍ５ｅｃ程度の窓掛けを行う。この窓掛けにより有限個
の標本値による短時間スペクトル分析が可能となり、こ
の窓掛はデータを基に再び線形予測分析を行い、線形予
測係数α１〜α２を算出する。ここで、ｐは線形予測分
析の次数であり、一般に女性の声に対してはＰ−１０程
度、男性の声に対してはＰ−１４程度を使用する。Find the spectral envelope of the original speech. That is, as shown in FIG. 4, the pitch start point is set immediately before the point where the waveform level suddenly increases from the basic unit audio waveform, and the next pitch is determined based on the pitch period initially determined by the pitch extraction control section 42. Define one pitch section with the end point one sample before the start point of ■20 with the center of the pitch section as the center of the analysis window.
Install windows of about m5ec. This windowing enables short-time spectrum analysis using a finite number of sample values, and this windowing performs linear prediction analysis again based on the data to calculate linear prediction coefficients α1 to α2. Here, p is the order of linear predictive analysis, and generally about P-10 is used for female voices, and about P-14 is used for male voices.

さらに、次式によフて上述の線形予測係数α１〜α２を
用いて原音声のスペクトル包絡１１（ｋ）を求める。Furthermore, the spectral envelope 11(k) of the original speech is determined using the above-mentioned linear prediction coefficients α1 to α2 according to the following equation.

（ｋ；１〜Ｎ）ここでＮは標本数より大きい２のべき乗で５１２とする
。(k; 1 to N) Here, N is a power of 2 that is larger than the number of samples, and is 512.

この処理を１ピッチ区間ずらしながら有声音区間が終る
まで繰り返す。This process is repeated while shifting one pitch section until the voiced sound section ends.

また、波形伸縮部４４では、ピッチ抽出制御部４２で得
た新しいピッチ周期情報に応じて各ピッチごとの波形を
伸縮する。すなわち、原単位音声波形の１ピッチ標本数
をｋとし、変更されたピッチに相当する標本数をに°と
するとき、ピッチ周期を短縮したい場合はピッチ区間の
開始からに′番目の標本点で波形を打ち切り、逆にピッ
チ周期を延ばしたい場合にはスペクトル包絡抽出部６２
で得られた線形予測係数α１〜α２を用いて、次式のよ
うにｍ−に＋１番目からｍ−ｋ　’番目までの標本値を
求め後続の波形を得る。Further, the waveform expansion/contraction section 44 expands/contracts the waveform for each pitch according to the new pitch period information obtained by the pitch extraction control section 42 . In other words, when the number of samples for one pitch of the basic unit speech waveform is k, and the number of samples corresponding to the changed pitch is °, if you want to shorten the pitch period, start at the 'th sample point from the start of the pitch section. If you want to truncate the waveform and extend the pitch period, use the spectrum envelope extraction section 62.
Using the linear prediction coefficients α1 to α2 obtained in the above, sample values from the +1st to the m-k'th m- are obtained as shown in the following equation to obtain the subsequent waveform.

ｘ（ｍ）＝　　ａ、ｘ（ｍ−１）＋ａ、ｘ（ｌｌ−２）
＋・・・＋　　ａ、ｘ（ｍ−Ｐ）ただし、この処理を１
ピッチ区間ずらしながら有声音区間が終るまで繰り返す
が、この際、ピッチ周期の伸縮分だけ発話速度が変化す
るので１ピッチ周期の波形単位で間引いたり同じ波形を
繰り返したりしながら原単位音声の発話時間長を保つ。x (m) = a, x (m-1) + a, x (ll-2)
+...+ a, x (m-P) However, this process is
It is repeated until the voiced sound section ends while shifting the pitch section, but at this time, the speech rate changes by the expansion and contraction of the pitch period, so the utterance time of the basic unit voice is thinned out in waveform units of 1 pitch period, and the same waveform is repeated. keep it long.

また同じ手段で音声選択部２４から得られる韻律情報に
基づいての継続時間長の補正もここで行う。Further, the duration length is also corrected here based on the prosody information obtained from the voice selection section 24 using the same means.

なお、ピッチを変更したことによってピッチ区間の波形
の最終標本点と次のピッチ区間の開始標本点との間には
大ぎな不連続があるので、この最終標本点と曲始漂木点
の前後数標本のデータを用いて最小自乗法により３次曲
線を用いた近似を行い、連続的に接続する。Furthermore, as a result of changing the pitch, there is a large discontinuity between the final sample point of the waveform of a pitch section and the starting sample point of the next pitch section, so Approximation using cubic curves is performed using the least squares method using data of several samples, and the data are connected continuously.

上述のピッチ抽出制御部４２、波形伸縮部４４およびス
ペクトル包絡抽出部６２による処理を終了すると、まず
、スペクトル包絡抽出部４６において、波形伸縮部４４
から得られるピッチ周期を変更した波形の１ピッチ区間
を中心として、上述したのと同様に２０＋ａＳｅｃ程度
の窓掛けを行いこの標本値について線形予測分析を行い
線形予測係数α１°〜α２°を算出し、次式によってピ
ッチ変更後のスペクトル包ｉ貼（ｋ）を求める。When the above-described processes by the pitch extraction control section 42, waveform expansion/contraction section 44, and spectrum envelope extraction section 62 are completed, first, in the spectrum envelope extraction section 46, the waveform expansion/contraction section 44
Centering on one pitch section of the waveform obtained by changing the pitch period obtained from , the spectral envelope i (k) after the pitch change is determined by the following equation.

（ｋ＝１〜Ｎ）　　　　　　（１）ここで、Ｎは前述と同様５１２とする。(k=1~N) (1) Here, N is assumed to be 512 as described above.

次に、補正分抽用部４８で、原音声のスペクトル包絡１
１（ｋ）に対し、ピッチ周期の変更によって歪んだスペ
クトル包絡貼（ｋ）の変化分を算出する。Next, the correction extracting unit 48 extracts the spectral envelope 1 of the original speech.
1(k), the amount of change in the spectral envelope (k) that is distorted due to the change in pitch period is calculated.

すなわち、Ｌｌ　（ｋ）・Ｈ（ｋ）　／凱（ｋ）を各ピ
ッチ周期毎に計算しメモリに記憶する。That is, Ll(k)·H(k)/Kai(k) is calculated for each pitch period and stored in the memory.

また、スペクトル包絡制御部６４においては、スペクト
ル包絡抽出部６２で求めたα１〜α２を係数として、以
下に示す式を満足するＰ個の根である複素数Ｚｌ”−Ｚ
ｐを求める。In addition, the spectral envelope control unit 64 uses α1 to α2 obtained by the spectral envelope extraction unit 62 as coefficients to calculate a complex number Zl”-Z, which is a root of P that satisfies the formula shown below.
Find p.

１＋ａ、ｚ−’＋　　ａ　２ｚ−２＋　　・・−＋　　
ＣＥ、Ｚ−’−０これらＰ個の根のうちには共役複素根
の対が存在し、１対の共役複素根は１つのホルマントに
対応し得る。すなわち、これらの根Ｚ、により以下の式
で共振周波数Ｆｌおよびその帯域幅ａ、を求め、メモリ
に記録すると共に上述した処理を１ピッチ区間毎にシフ
トしながら単位音声中の有声音区間が終るまで繰り返す
。1+a,z-'+ a 2z-2+ ・・-+
CE, Z-'-0 Among these P roots, there are pairs of conjugate complex roots, and one pair of conjugate complex roots can correspond to one formant. That is, from these roots Z, the resonant frequency Fl and its bandwidth a are determined by the following formula, and are recorded in memory, and the above-mentioned process is shifted for each pitch interval until the voiced sound interval in the unit speech ends. Repeat until.

Ｆ＋−ＦＳ／（２π）・ａｒｇ（ｚｌ）８１−Ｆｓ／π
ｌｌｏｇ（ｌｚ＋ｌ）ここで、Ｆｓは標本化周波数である。F+-FS/(2π)・arg(zl)81-Fs/π
llog(lz+l) where Fs is the sampling frequency.

さらに、一連の共振周波数からその帯域幅と連続性を考
慮して帯域幅の狭い共振周波数を周波数の低いほうから
順に第１ホルマント、第２ホルマント、第３ホルマント
・・・とじて選択し、ホルマント周波数の軌跡を求める
。Furthermore, from a series of resonant frequencies, a resonant frequency with a narrow bandwidth is selected in order from the lowest frequency, taking into account its bandwidth and continuity, as the first formant, second formant, third formant, etc. Find the frequency locus.

スペクトル包絡制御部６４ではさらに、第５図に示すよ
うに直前の単位音声との接続性をよくするために、ホル
マントとその帯域幅の軌跡を、直前単位音声の最終標本
点と、処理にかかる単位音声における開始標本点の前後
数標本を用い、前述と同様に最小自乗法により３次曲線
近似により内挿を行って連続的に接続する。Furthermore, as shown in FIG. 5, the spectral envelope control unit 64 converts the locus of the formant and its bandwidth to the final sample point of the immediately preceding unit speech and the process required to improve the connectivity with the immediately preceding unit speech. Using several samples before and after the starting sample point in the unit speech, interpolation is performed by cubic curve approximation using the least squares method as described above, and the samples are continuously connected.

次に、上述のようにして新たなホルマント周波数の軌跡
とｆ域幅が決定したら、新たな線形予測係数を以下のよ
うにして求める。Next, once a new formant frequency locus and f-bandwidth are determined as described above, new linear prediction coefficients are determined as follows.

すなわち、変更されたホルマントおよび変更されなかっ
たホルマントの共振周波数、さらにホルマントと認めら
れなかフた共振周波数を含めて、新しい共振周波数をＦ
ｌｏ　その帯域幅をＢＩｏ　とし、次式を用いて新な根
Ｚｌ’を求める。なお、Ｚｌｏにはホルマントに対応し
た共役複素根対が含まれる。That is, F
lo Let the bandwidth be BIo, and find a new root Zl' using the following equation. Note that Zlo includes a pair of conjugate complex roots corresponding to formants.

Ｚｌ’−ｅｘｐ（−７ＣＢ＋’／Ｆｓ”ｊ２　πｐ＋’
／ｐｓ）これらＰ個のＺｌ’　を根とするｐ次方程式を
（１−Ｌ’Ｚ−’）　（１−Ｚ２°Ｘ−’）・”　（１
−Ｚｐ’Ｚ−’）−０とし、この式を展開したときのＺ
−にの係数をβにとすれば、上式は１”Ｊ３　＋Ｚ−”　Ｂ　２Ｚ−”−十Ｂ　ＰＺ−’−
０と表わされ、係数β、〜β２は新しい線形予測係数を
与える。この新たな線形予測係数を用いて次式によりス
ペクトル包絡１ｎ（ｋ）を求める。Zl'-exp(-7CB+'/Fs"j2 πp+'
/ps) The p-order equation with these P Zl' as roots is (1-L'Z-') (1-Z2°X-')・" (1
-Zp'Z-')-0, and when this formula is expanded, Z
If the coefficient of
0, and the coefficients β, ˜β2 give the new linear prediction coefficients. Using this new linear prediction coefficient, the spectral envelope 1n(k) is determined by the following equation.

（ｋ＝１〜Ｎ）　　　　　　（２）ここでＮは５１２とする。(k=1~N) (2) Here, N is assumed to be 512.

補正弁抽出部６６では、スペクトル包絡抽出部６２で求
めた原音声のスペクトル包絡１１（ｋ）に対し、ホルマ
ントの変更により変更されたスペクトルｒ　（ｋ）の変
化分を算出する。すなわち、Ｖ　（ｋ）　−ＩＩ　（ｋ
）　／Ｆ２　（ｋ）　ヲ各ヒツチ周期毎に計算しメモリ
に記録する。The correction valve extractor 66 calculates the amount of change in the spectrum r (k) that has been changed due to the formant change, with respect to the spectrum envelope 11(k) of the original speech obtained by the spectrum envelope extractor 62. That is, V (k) −II (k
) /F2 (k) wo Calculate for each hit cycle and record in memory.

周波数特性変更部８では、まずＦＦ７部８２において、
ピッチ抽出制御部４２でピッチの変更されたピッチ周期
におけるＮ個の標本ｘ（１）〜×（Ｎ）に対し次式のよ
うな時間窓ｗ　（ｉ）を掛けてｙ（１）〜ｙ（Ｎ）とす
る。すなわち、ｙ（＋）す（ｉ）・ｘ　（ｉ）　　　　　　　　　１≦
ｉ≦Ｎただしｗ（ｉ）Ｊ、５−（１−ｃｏｓ（ｙｒ　ｉ／Ｌ））　　
　　　１≦ｉ＜Ｌｗ（ｉ）＝Ｉ　　　　　　　　　　　
　　　　　　　　　　　　　　　　Ｌ≦ｉ＜Ｎ−Ｌｗ（
ｉ）−０，５・［ｌ＋ｃｏｓ（π　（ｉ−Ｎ＋Ｌ）／Ｌ
）］　　　Ｎ−Ｌ≦　ｉ　≦Ｎ上記ｙ　（ｉ）に対して
Ｎ点の高速フーリエ変換を行い、周波数領域に変換して
Ｙ　（ｋ）とする。In the frequency characteristic changing section 8, first, in the FF7 section 82,
The pitch extraction control unit 42 multiplies N samples x(1) to x(N) in the pitch period whose pitch has been changed by a time window w (i) as shown in the following equation to obtain y(1) to y( N). That is, y(+)su(i)・x(i) 1≦
i≦N where w(i)J, 5-(1-cos(yr i/L))
1≦i<Lw(i)=I
L≦i<N−Lw(
i) −0,5・[l+cos(π (i−N+L)/L
)] N-L≦i≦N The above y (i) is subjected to N-point fast Fourier transform and transformed into the frequency domain to become Y (k).

次に、スペクトル変換部８４において、補正弁抽出部６
６で算出したホルマントの変更による変化成分Ｖ　（ｋ
）と補正弁抽出部４８で算出した周期変更による変化分
Ｕ（ｋ）とを用いてＹ　（ｋ）を変更する。すなわち、智（ｋ）　−ｔｌ　（ｋ）　・Ｖ　（ｋ）　−Ｙ　（ｋ
）　　　　　　　　　　１≦に≦Ｎとして補正された周
波数領域表現ｒ（ｋ）を得る。Next, in the spectrum conversion section 84, the correction valve extraction section 6
The change component V (k
) and the change amount U(k) due to the cycle change calculated by the correction valve extraction unit 48 to change Y(k). That is, Toshi (k) −tl (k) ・V (k) −Y (k
) Obtain a frequency domain representation r(k) corrected as 1≦ and ≦N.

次に、ＩＦＦＴ部８６では高速フーリエ逆変換によりこ
のｒ（ｋ）を時間領域の音声波形？（ｋ）に変換し、得
られたＮ個のデータのうち波形接続の際の端の歪の効果
を軽減するため、標本データの中心に２０ｍ５ｅｃのハ
ミング窓を掛けて切り出す。Next, the IFFT unit 86 converts this r(k) into a time-domain audio waveform by inverse fast Fourier transform. (k), and in order to reduce the effect of distortion at the edges during waveform connection among the N pieces of data obtained, a Hamming window of 20 m5ec is applied to the center of the sample data and cut out.

次に、第６図に示すようにｙ　（ｋ）を１０ｎＳｅｃだ
けシフトし、スペクトル包絡の補正と切り出しの上述し
た一連の操作を繰り返し、直前に切り出した波形と重ね
合わせて連続した音声とする。Next, as shown in FIG. 6, y (k) is shifted by 10 nSec, and the above-described series of operations of correcting the spectrum envelope and cutting out is repeated, and the waveform cut out just before is superimposed to form continuous audio.

さらに一つの有声音区間の処理が終了したらメモリに記
憶しておいた無声子音区間または、無音区間と接続し、
さらに次の有声音区間の処理に移る。このように全ての
単位音声の処理を行い、最終的に合成された音声をＤ／
Ａ変換して、出力音声とする。Furthermore, when the processing of one voiced sound section is completed, it is connected to the unvoiced consonant section or silent section stored in the memory,
The process then moves on to the next voiced section. After processing all the unit sounds in this way, the final synthesized sound is converted to D/
A conversion is performed to produce output audio.

ところで、人間の言語音声の知覚特性については、まだ
まだ未知の部分が多い。音声信号波形の物理的な変更が
言語の音韻としての間こえ方、すなわち人間の知覚特性
にどのような影響を与えるか、例えば声の高さや韻律を
微妙に変えたときどのように間こえるかを、試験しよう
とする場合、実際の人間を発声者として使用すると、体
調の変化等のために全く同じ音声を発することが困難で
あったり、また微妙な調音をさせることは一般に困難で
ある。By the way, there are still many unknowns about the perceptual characteristics of human speech sounds. How does a physical change in the speech signal waveform affect the phonology of language, that is, the perceptual characteristics of humans? For example, how does it sound when the pitch or prosody of the voice is slightly changed? When trying to test a person using an actual person as a speaker, it may be difficult to produce exactly the same voice due to changes in physical condition, etc., and it is generally difficult to produce subtle articulations.

これらの理由により、人工的な発生音の調整法が必要と
されていた。しかし従来の方法を用いると声帯振動周波
数やホルマント周波数の不連続のために不自然さが生じ
微妙な聴寛心理的な実験に困難さが生じてしまっていた
。For these reasons, a method for adjusting artificially generated sounds has been needed. However, when conventional methods are used, unnaturalness occurs due to discontinuities in vocal fold vibration frequencies and formant frequencies, making delicate psychological experiments difficult.

これに対し本発明の上述した実施例によれば音声の自然
性を保ったままきめ細かい制御ができるためにこの困難
さが解消できるようになる。In contrast, according to the above-described embodiments of the present invention, this difficulty can be overcome because fine control can be performed while maintaining the naturalness of the voice.

［発明の効果］以上の説明から明らかなように、本発明によれば予め記
憶された単位音声を接続する際の不連続による音質劣化
を防ぎ原音声の持っている自然性や個人性に影響を与え
ずにイントネーションやアクセント等を付加して音声の
合成ができるようになる。[Effects of the Invention] As is clear from the above explanation, according to the present invention, it is possible to prevent deterioration of sound quality due to discontinuity when connecting pre-stored unit sounds, and to prevent the deterioration of sound quality due to discontinuity, which affects the naturalness and individuality of the original sound. It becomes possible to synthesize speech by adding intonation, accent, etc. without giving any.

また、本発明によってイントネーションやアクセント等
の韻律性やホルマント周波数等に基づく音韻性を任意に
変化させて音声を合成できるので人間の言語音声に対す
る知覚特性の測定方法を提供できるようになる。Further, according to the present invention, speech can be synthesized by arbitrarily changing prosodic properties such as intonation and accent, and phonological properties based on formant frequency, etc., so that it becomes possible to provide a method for measuring the perceptual characteristics of human language speech.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示す音声合成システムのブ
ロック図、第２図（Ａ）は第１図に示したシステムの詳細を示すブ
ロック図、第２図（６）は第２図（Ａ）に示したシステムの処理を
示すフローチャート、第３図は声帯振動周波数の軌跡の連続性を保りた変換法
を説明するための線図、第４図はピッチ区間の波形切り出し法を示す線図、第５図はホルマント周波数帯域幅の連続性を保つ変換法
を示す線図、第６図は処理された波形を接続する方法を示す線図であ
る。２２・・・単位音声蓄積部、２４・・・音声選択部、４２・・・ピッチ抽出制御部、４４・・・波形伸縮部、４６・・・スペクトル包絡抽出部、４８・・・補正分抽土部、６２・・・スペクトル包絡抽出部、６４・・・スペクトル包絡制御部、６６・・・補正分抽土部、８２・・・ＦＦＴ部、８４・・・スペクトル包絡変更部、８６・・・ＩＦＦＴ部。ｂ？・ト冒報〜慎？−ｒ；！ｌイ４’Ｊの５μｍ子坦２第４図寅オそイ列の５度升ら図第５図FIG. 1 is a block diagram of a speech synthesis system showing an embodiment of the present invention, FIG. 2 (A) is a block diagram showing details of the system shown in FIG. 1, and FIG. 2 (6) is a block diagram of a speech synthesis system shown in FIG. A flowchart showing the processing of the system shown in (A), Fig. 3 is a diagram to explain the conversion method that maintains the continuity of the locus of vocal fold vibration frequency, and Fig. 4 shows the method for cutting out the waveform of the pitch section. FIG. 5 is a diagram showing a conversion method that maintains the continuity of formant frequency bandwidth, and FIG. 6 is a diagram showing a method for connecting processed waveforms. 22... Unit audio storage section, 24... Audio selection section, 42... Pitch extraction control section, 44... Waveform expansion/contraction section, 46... Spectral envelope extraction section, 48... Correction component extraction Dobe, 62... Spectrum envelope extraction unit, 64... Spectrum envelope control unit, 66... Correction extraction unit, 82... FFT unit, 84... Spectral envelope modification unit, 86...・IFFT Department. b?・To-splash ~ Shin? -r;! 5 μm platen 2 of 4'J Figure 5 Figure 5

Claims

[Scope of Claims] 1) Determining unit voice information and prosody information based on output voice information, and determining unit voice corresponding to the unit voice from among pre-stored unit voice data based on the determined unit voice information. selecting data, calculating or extracting each of the pitch period, spectral envelope, formant locus, and unit speech waveform from the selected unit speech data, and connecting the calculated or extracted unit speech waveforms, and the prosody. In order to add information, the calculated or extracted pitch period is changed, a spectral envelope due to the pitch change is calculated in the changed pitch period, and the spectral envelope due to the pitch change and the calculated or extracted spectral envelope are calculate a first spectral change based on the above, change the calculated or extracted formant locus in order to connect the calculated or extracted unit speech waveforms, and change the formant based on the changed formant locus. calculate a second spectrum change based on the spectrum envelope due to the formant change and the calculated or extracted spectrum envelope; and calculate the pitch period based on the first and second spectrum changes. A speech synthesis method comprising: changing the spectral envelope of a unit speech waveform related to the change; connecting the unit speech waveforms whose spectral envelopes have been changed; and then outputting the connected speech. 2) A speech synthesis device characterized by synthesizing speech using the method according to claim 1.