JP3532064B2

JP3532064B2 - Speech synthesis method and speech synthesis device

Info

Publication number: JP3532064B2
Application number: JP09721097A
Authority: JP
Inventors: 幸雄田部井
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-04-15
Filing date: 1997-04-15
Publication date: 2004-05-31
Anticipated expiration: 2017-04-15
Also published as: JPH10288999A

Abstract

PROBLEM TO BE SOLVED: To synthesize a speech of high quality by accurate driving point detection. SOLUTION: A speech signal is divided into analytic frames (S200) and processed by low-pass filteration (S201), the largest value max nearby the center in an analytic frame is detected, and the time coordinate tm of the largest value is set as a tentative driving point (S204). Further, a maximum value is detected by tracing the frame back by a specific value tp×a from the tentative driving point (S205) and all maximum values detected in the traced-back time coordinate area are compared with a threshold value max×b individually (S208). When a maximum value is larger than the above-mentioned threshold value, the time coordinate of the maximum value is replaced as a driving point (S209), the speech signal is segmented about the driving point (S212) to generate an element piece previously, and the driving point in the element piece is put in the center of superposition to perform window superposition while it is shifted by a pitch cycle.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、規則によって任意
の音声を合成する音声合成方法及び音声合成装置に関
し、特に、音声波形を接続して合成音声を得る音声合成
方法および音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and a voice synthesizing apparatus for synthesizing an arbitrary voice according to a rule, and more particularly to a voice synthesizing method and a voice synthesizing apparatus for connecting voice waveforms to obtain a synthetic voice.

【０００２】[0002]

【従来の技術】従来のテキスト音声変換装置、即ちテキ
スト文章を音声に変換して出力するテキスト音声変換装
置としては、テキスト解析部とパラメータ生成部と音声
合成部とから構成された装置が一般に知られている。テ
キスト解析部では、漢字かな混じり文が入力され、単語
辞書を参照して形態素解析がなされて、読み、アクセン
ト、イントネーションが決定され、韻律記号付き発音記
号（中間言語）が出力される。パラメータ生成部では、
ピッチ周波数パターンや音韻継続時間等の設定が行われ
る。音声合成部では、音声の合成処理が行われる。この
音声合成部での音声合成処理としては、以前は線形予測
法などが用いられていたが、これらの方法では情報が劣
化してしまう。即ち、本来相互関係がある声道情報と音
源情報を分離して扱うことによる音質の劣化と、音声生
成過程のモデル化による制約からくる音質の劣化は避け
られなかった。このため、近年、声道情報と音源情報と
を明確には分離せず、さらに原音声波形をそのまま利用
して、即ち原音声波形に含まれる細かく微妙な変動を人
工的なモデル化なしで生かし、劣化の少ない高品質の合
成音を得る手法が用いられるようになってきた。2. Description of the Related Art As a conventional text-to-speech conversion apparatus, that is, a text-speech conversion apparatus for converting a text sentence into speech and outputting the speech, an apparatus composed of a text analysis section, a parameter generation section and a speech synthesis section is generally known. Has been. In the text analysis unit, a kanji / kana mixed sentence is input, morphological analysis is performed with reference to a word dictionary, pronunciation, accent, and intonation are determined, and a phonetic symbol with a prosody symbol (intermediate language) is output. In the parameter generator,
The pitch frequency pattern, phoneme duration, etc. are set. The voice synthesis unit performs a voice synthesis process. As a speech synthesis process in this speech synthesis unit, a linear prediction method or the like has been used before, but these methods result in deterioration of information. That is, the deterioration of the sound quality due to the separate handling of the vocal tract information and the sound source information that are originally related to each other and the deterioration of the sound quality due to the restriction due to the modeling of the sound generation process are unavoidable. Therefore, in recent years, vocal tract information and sound source information are not clearly separated, and the original speech waveform is used as it is, that is, the subtle fluctuations contained in the original speech waveform are utilized without artificial modeling. However, a method for obtaining high-quality synthesized speech with little deterioration has been used.

【０００３】音声波形をそのまま利用する方法として
は、従来、文献：「“F.J. CHARPENTIER，M.G. STELL
A，DIPHONE SYNTHESIS USING AN OVERLAP-ADD TECHNIQU
E FOR SPEECH WAVEFORMS CONCATENATION”，Proc.Int.C
onf.ASSP,TOKYO,1986 PP2015-2018」に示されるものが
知られている。この方法は、予め音声波形にピッチマー
ク（基準点）を付けておき、そのピッチマークの位置を
中心に音声波形を切り出し、合成時に合成ピッチ周期に
合わせてピッチマーク位置をその周期ずつずらしながら
重ね合わせる合成方法で、PSOLA（Pitch-Synchronous O
verlap Add method「ピッチ同期波形重畳法」）として
知られている。As a method of directly using a voice waveform, there is a conventional method: "FJ CHARPENTIER, MG STELL.
A, DIPHONE SYNTHESIS USING AN OVERLAP-ADD TECHNIQU
E FOR SPEECH WAVEFORMS CONCATENATION ”, Proc.Int.C
onf.ASSP, TOKYO, 1986 PP 2015-2018 ”is known. In this method, a pitch mark (reference point) is attached to the voice waveform in advance, the voice waveform is cut out around the position of the pitch mark, and the pitch mark position is overlapped by shifting the pitch mark position in accordance with the synthetic pitch period during synthesis. PSOLA (Pitch-Synchronous O
verlap Add method "Pitch synchronization waveform superposition method").

【０００４】図２は前記文献から引用したもので、ピッ
チを変更しながら音声波形を重畳するピッチ同期波形重
畳法を示す模式図である。この模式図では、分析時（素
片作成時）に比べて、合成時のピッチ周期を大きくした
（音程を低くした）場合の例を示す。FIG. 2 is cited from the above document and is a schematic diagram showing a pitch synchronization waveform superimposing method for superimposing a voice waveform while changing a pitch. This schematic diagram shows an example in which the pitch period during synthesis is increased (pitch is lowered) as compared with the time of analysis (during segment production).

【０００５】このピッチ同期波形重畳法では、必要に応
じてピッチを変更できるため、テキスト音声変換におけ
る音声合成部として広く用いられてきている。この場
合、ピッチマークを音声波形の１ピッチ毎の特定位置に
付けておく必要があるが、このピッチマークの位置とし
て下記のものが提案されている。In the pitch synchronization waveform superimposing method, the pitch can be changed as necessary, and thus it has been widely used as a voice synthesizing unit in text-to-speech conversion. In this case, the pitch mark needs to be attached to a specific position for each pitch of the voice waveform, and the following positions have been proposed as the position of the pitch mark.

【０００６】（１）音声波形のピークをピッチマークの
設定位置とするものとして、例えば特開平４−３７２９
９９号公報に記載の「音声ピッチ変換方法」がある。(1) For setting the peak of the voice waveform as the pitch mark setting position, for example, Japanese Patent Laid-Open No. 4-3729.
There is a "voice pitch conversion method" described in Japanese Patent Publication No. 99.

【０００７】この場合、音声波形のローカルピーク位置
はエネルギーが集中しているため、切り出し波形のスペ
クトルを保存するのに適していると考えられる。In this case, since the energy is concentrated at the local peak position of the voice waveform, it is considered to be suitable for storing the spectrum of the cut-out waveform.

【０００８】（２）短時間パワーのピークをピッチマー
クの設定位置とするものとして、例えば「“波形素片接
続型音声合成システムの検討” 河井恒、樋口宜
男、清水徹、山本誠一信学技報SP93-9(1993-05)
社団法人電子情報通信学会」がある。(2) For setting the peak of the short time power as the setting position of the pitch mark, for example, "" Examination of speech synthesis system with waveform segment connection "Tsune Kawai, Yoshio Higuchi, Tohru Shimizu, Seiichi Yamamoto Technical report SP93-9 (1993-05)
The Institute of Electronics, Information and Communication Engineers.

【０００９】この場合も、前記（１）の場合と同様に、
音声波形の短時間パワーのローカルピーク位置はエネル
ギーが集中しているため、切り出し波形のスペクトルを
保存するのに適していると考えられる。Also in this case, as in the case of (1) above,
Since the energy is concentrated at the local peak position of the short-time power of the voice waveform, it is considered to be suitable for storing the spectrum of the cut-out waveform.

【００１０】（３）ピッチフィルタ後のピークをピッチ
マークの設定位置とするものとして、例えば特開平７−
７２８９７号公報に記載の「音声合成方法および装置」
がある。(3) A method in which the peak after the pitch filter is set as the pitch mark setting position is disclosed in, for example, Japanese Patent Application Laid-Open No. 7-
"Speech synthesis method and device" described in Japanese Patent No. 72897.
There is.

【００１１】ピッチフィルタ後のピークは１ピッチの声
帯の駆動波形のピークであり、前記文献によれば、ピッ
チ間隔を良好に代表するものであると報告されている。The peak after the pitch filter is the peak of the drive waveform of a 1-pitch vocal cord, and according to the above document, it is reported that it is a good representative of the pitch interval.

【００１２】（４）インパルス駆動点の１５％遅延点を
ピッチマークの設定位置とするものとして、例えば
「“ピッチ波形抽出位置の検討” 新居康彦、西村
洋文、吉田博子、蓑輪利光信学技報SP95-8(1995-0
5) 社団法人電子情報通信学会」がある。(4) As an example in which the 15% delay point of the impulse driving point is set as the pitch mark setting position, for example, "A study of pitch waveform extraction position" Yasuhiko Arai, Nishimura
Hiroshi Yoshida, Toshimitsu Minowa IEICE Technical Report SP95-8 (1995-0
5) The Institute of Electronics, Information and Communication Engineers.

【００１３】この文献によると、スペクトル歪みが最小
になると報告されている。According to this document, it is reported that the spectral distortion is minimized.

【００１４】（５）声門閉鎖点をピッチマークの設定位
置とするものとして、例えば「“波形重畳法を用いた日
本語テキスト音声合成システムについて” 阪本正
治、斉藤隆、鈴木和洋、橋本泰秀、小林メイ信
学技報SP95-6(1995-05) 社団法人電子情報通信学
会」がある。(5) As an example of setting the glottal closing point as the pitch mark setting position, for example, "About Japanese text-to-speech synthesis system using waveform superposition method" S. Sakamoto, T. Saito, K. Suzuki, Y. Hidehashi, Kobayashi There is Meishin Giho SP95-6 (1995-05) The Institute of Electronics, Information and Communication Engineers.

【００１５】この文献の声門閉鎖点とは、インパルス駆
動点（１ピッチ波形の励振点）と同様のものであると考
えられる。この声門閉鎖点を安定的に抽出するために、
Dynamic Wavelet変換が用いられている。The glottal closing point in this document is considered to be the same as the impulse driving point (excitation point of one pitch waveform). In order to stably extract this glottal closing point,
Dynamic Wavelet transform is used.

【００１６】[0016]

【発明が解決しようとする課題】しかしながら、前述の
ような従来のピッチマーク位置では次のような問題点が
あった。However, the conventional pitch mark position as described above has the following problems.

【００１７】前記（１）の音声波形のピークは、図３及
び図４に示すように、音韻（特に／ａ／や／ｅ／）によ
っては、その認識が難しい場合がある。即ち、ピッチマ
ークを自動的に付与する場合、その位置を誤りやすく、
手動で付与する場合、微妙な差異に判断に迷ってしま
う。As shown in FIGS. 3 and 4, the peak of the voice waveform of (1) may be difficult to recognize depending on the phoneme (particularly / a / or / e /). That is, when the pitch mark is automatically added, it is easy to make a mistake in the position,
When it is given manually, it is difficult to judge due to subtle differences.

【００１８】ここで、図３は／ｈｅ／と発音した／ｅ／
の部分の音声波形で、図４は／ｍｅ／と発音した／ｅ／
の部分の音声波形である。これらの図において、本来は
すべてａ点にピッチマークが付与されるべきである。し
かし、自動的に音声波形のピークを抽出してピッチマー
クを付与しようとすると、図３ではｂ,ａ,ａ,ａ,ａ,ｂ
点に、図４ではｂ,ｂ,ｂ,ａ,ａ,ａ点にピッチマークが
付与されてしまう。これに基づいて音声合成をすると、
ピッチ間隔が揺らぐため、合成音は歪んだ音質となる。Here, in FIG. 3, / e / pronounced as / he /
In the voice waveform of the part, the pronunciation of / me / in Figure 4 is / e /
Is the voice waveform of the part. In these figures, the pitch mark should be originally given to all points a. However, when it is attempted to automatically extract the peaks of the voice waveform and add pitch marks, b, a, a, a, a, b in FIG.
In FIG. 4, pitch marks are added to the points b, b, b, a, a, and a. If you do voice synthesis based on this,
Since the pitch interval fluctuates, the synthesized sound has distorted sound quality.

【００１９】また、手動で音声波形のピークにピッチマ
ークを付与する場合には、ａ点なのかｂ点なのかの明確
な基準がないため、時間的に後ろの方まで見ないとその
判断を迷ってしまい、効率的ではない。In addition, when a pitch mark is manually added to the peak of a voice waveform, there is no clear reference as to whether it is the point a or the point b. I am lost and not efficient.

【００２０】一方、前記（３）のピッチフィルタ後のピ
ークをピッチマークの設定位置とするものでは、以下の
ような問題点がある。本出願人の実験によれば、特開平
７−７２８９７号公報に記載されているカットオフ２５
６ＨｚのＬＰＦでは、波形がかなりなまってしまう。こ
のため、波形のピーク位置との間にズレがあり、このズ
レによるピッチの揺れが大きく、ゴロゴロした歪感のあ
る音声になってしまう。むしろ、（１）の音声波形のピ
ークをピッチマークの設定位置とする方が比較的良好な
結果となった。On the other hand, in the above-mentioned (3) in which the peak after the pitch filter is used as the set position of the pitch mark, there are the following problems. According to an experiment by the applicant, the cutoff 25 described in JP-A-7-72897 is disclosed.
With the 6 Hz LPF, the waveform is considerably rounded. For this reason, there is a deviation from the peak position of the waveform, the pitch fluctuates due to this deviation, and the sound becomes jumbled and distorted. Rather, the result of (1) is relatively good when the peak of the voice waveform is set as the pitch mark setting position.

【００２１】（２）の短時間のパワーのピークをピッチ
マークの設定位置とするものでは、極大値と極小値が対
等に評価されるため、発声者によってはピッチの揺れを
生じ、合成音に歪感を生じることがある。（４）のイン
パルス駆動点の１５％遅延点をピッチマークの設定位置
とするものでは、設定位置の特定等のための処理量が多
くなり、処理に遅延を生じ、また個人や音韻の種類によ
っては、１５％の遅延点が最良とは限らない。（５）の
声門閉鎖点をピッチマークの設定位置とするものでは、
この声門閉鎖点の抽出のために行うDynamic Wavelet変
換は処理量が多く、前記（４）と同様に、処理に遅延を
生じる。In the case of (2) in which the short-time power peak is set as the set position of the pitch mark, the maximum value and the minimum value are evaluated equally, so that some utterers cause pitch fluctuations, resulting in a synthesized sound. Distortion may occur. In the case where the pitch mark setting position is the 15% delay point of the impulse driving point in (4), the processing amount for specifying the setting position is large, resulting in a delay in processing, and depending on the individual and the type of phoneme. , The delay point of 15% is not always the best. In the case where the glottal closing point in (5) is set as the pitch mark setting position,
The Dynamic Wavelet transform performed for extracting the glottal closing point has a large amount of processing, and causes a delay in processing as in (4) above.

【００２２】本発明は、前記問題点に鑑み、音声生成過
程において声門の駆動インパルスの影響が理論的に音声
波形の最初のピークに現れる点に考慮してなされたもの
で、ピッチの揺れが少ないピッチマークの設定を可能に
して、高品質の音声合成方法及び音声合成装置を実現す
ることを目的とする。In view of the above problems, the present invention has been made in consideration of the fact that the influence of the glottal drive impulse theoretically appears at the first peak of the voice waveform in the voice generation process, and there is little pitch fluctuation. An object of the present invention is to realize a high-quality voice synthesizing method and voice synthesizing device by enabling pitch mark setting.

【００２３】[0023]

【課題を解決するための手段】前記課題を解決するため
に、第１の発明に係る音声合成方法は、音声信号の１ピ
ッチ波形の最初の極大点を検出する工程と、前記音声信
号に対して前記最初の極大点を中心にセンタリングして
前記音声信号を切り出す工程とにより音声合成素片を予
め作成しておき、前記音声合成素片中の前記極大点を重
畳の中心として、ピッチ周期分ずらしながら窓掛け重畳
することを特徴とする。In order to solve the above-mentioned problems, a voice synthesizing method according to a first aspect of the present invention includes a step of detecting a first maximum point of a one-pitch waveform of a voice signal, A voice synthesis unit is created in advance by a step of centering the first maximum point and cutting out the voice signal, and the maximum point in the voice synthesis unit is set as the center of superimposition, and a pitch period The feature is that the windows are overlapped while being shifted.

【００２４】以上のように、音声信号の１ピッチ波形の
最初の極大点を、切り出す音声信号の中心点にしている
ので、重畳する際の中心点を簡易な処理によって容易に
設定することができる。これにより、ピッチの揺れが少
なくスペクトル歪みも小さくすることができる。この結
果、聴感上ゴロゴロした音が減少した。As described above, since the first maximum point of the one-pitch waveform of the audio signal is the center point of the audio signal to be cut out, the center point for superimposing can be easily set by a simple process. . This makes it possible to reduce pitch fluctuation and reduce spectral distortion. As a result, the rumbling sound was reduced.

【００２５】第２の発明に係る音声合成方法は、前記最
初の極大点検出に際して、前記音声信号の低域濾波波形
を用いることを特徴とする。The voice synthesizing method according to the second invention is characterized in that the low-pass filtered waveform of the voice signal is used when the first maximum point is detected.

【００２６】以上のように、音声信号の低域濾波波形を
用いると、高周波の細かい変動が除去されてなだらかな
波形になり、極大点の検出が容易になる。これにより、
重畳する際の中心点を容易に設定することができる。As described above, when the low-pass filtered waveform of the audio signal is used, fine fluctuations of high frequency are removed to form a smooth waveform, which facilitates detection of the maximum point. This allows
The center point for superimposing can be easily set.

【００２７】第３の発明に係る音声合成方法は、音声信
号を分析フレームに分割する行程と、前記音声信号を低
域濾波処理する行程と、前記分析フレーム内でフレーム
中央近傍の最大値を検出し、その最大値の時間座標を仮
の駆動点として設定する行程と、当該仮の駆動点の時間
座標から遡った時間座標域で極大値を検出する行程と、
当該遡った時間座標域で検出した全極大値としきい値と
を個々に比較する行程と、前記しきい値との比較の結
果、極大値の方が大きい場合に当該極大値の時間座標を
駆動点として置き換え、当該駆動点を中心にセンタリン
グして前記音声信号を切り出す行程とにより音声合成素
片を予め作成しておき、前記音声合成素片中の前記駆動
点を重畳の中心として、ピッチ周期分ずらしながら窓掛
け重畳することを特徴とする。In the speech synthesis method according to the third aspect of the invention, the step of dividing the speech signal into analysis frames, the step of low-pass filtering the speech signal, and detecting the maximum value near the center of the frame within the analysis frame. Then, the process of setting the time coordinate of the maximum value as the temporary driving point, and the process of detecting the maximum value in the time coordinate range traced back from the time coordinate of the temporary driving point,
The process of individually comparing all the maximum values detected in the traced time coordinate area and the threshold value, and as a result of the comparison with the threshold value, when the maximum value is larger, the time coordinate of the maximum value is driven. A voice synthesis unit is created in advance by the process of cutting out the voice signal by centering the drive point as a center, and the drive point in the voice synthesis unit is set as the center of superposition, and the pitch cycle is set. It is characterized in that the windows are overlapped while being shifted.

【００２８】以上のように、フレーム中央近傍の最大値
を仮の駆動点として設定し、その点から遡った時間座標
域で極大値を検出して、前記仮の駆動点との比較におい
て最終的に切り出し中心である駆動点を設定するので、
音声生成過程においける声門の駆動インパルスの影響と
しての最初のピーク検出が可能になり、正確なピッチマ
ークの設定が可能になる。As described above, the maximum value in the vicinity of the center of the frame is set as the temporary driving point, the maximum value is detected in the time coordinate area traced back from that point, and the final value is compared in comparison with the temporary driving point. Since the driving point that is the center of cutting is set in,
It is possible to detect the first peak as an influence of the glottal drive impulse in the voice generation process, and it is possible to accurately set the pitch mark.

【００２９】また、重畳する際の中心点を簡易な処理に
よって容易に設定することができ、ピッチの揺れが少な
くスペクトル歪みも小さくすることができる。この結
果、聴感上ゴロゴロした音が減少した。Further, the center point at the time of superimposing can be easily set by a simple process, and the fluctuation of the pitch can be reduced and the spectral distortion can be reduced. As a result, the rumbling sound was reduced.

【００３０】さらに、一定長さを単位として音声素片を
扱い、フレーム処理を行うことで、音声合成時におい
て、音声波形データを制御しやすくなる。Further, by treating the speech units in units of a fixed length and performing frame processing, it becomes easier to control the speech waveform data at the time of speech synthesis.

【００３１】第４の発明に係る音声合成方法は、前記分
析フレーム内でのピッチ周期を検出し、前記仮の駆動点
から当該ピッチ周期の所定数倍の時間座標域内で遡って
極大値を検出することを特徴とする。In the voice synthesizing method according to the fourth aspect of the present invention, the pitch period in the analysis frame is detected, and the maximum value is detected by tracing back from the temporary driving point in a time coordinate range of a predetermined number of times the pitch period. It is characterized by doing.

【００３２】以上の構成により、極大値の検出を行う領
域を一定範囲に制限するので、正確なピッチマークの設
定を可能にする。With the above configuration, the area for detecting the maximum value is limited to a certain range, so that the pitch mark can be set accurately.

【００３３】第５の発明に係る音声合成方法は、前記し
きい値を、前記分析フレーム内のフレーム中央近傍で仮
の駆動点として検出した最大値の所定数倍に設定したこ
とを特徴とする。A speech synthesis method according to a fifth aspect of the invention is characterized in that the threshold value is set to a predetermined multiple of a maximum value detected as a temporary driving point near the center of the frame in the analysis frame. .

【００３４】以上の構成により、しきい値を、仮の駆動
点として検出した最大値の所定数倍に設定したので、小
さい余分な極大値を誤って検出することがなくなる。こ
れにより、雑音成分が重畳しても、余分な極大値に影響
されることなく、安定して動作する。With the above configuration, the threshold value is set to a predetermined number of times the maximum value detected as the temporary driving point, so that a small extra maximum value will not be erroneously detected. As a result, even if a noise component is superposed, stable operation is achieved without being affected by an extra maximum value.

【００３５】第６の発明に係る音声合成方法は、前記音
声信号の音韻の種類に応じて所定値を設定し、前記仮の
駆動点から当該所定値分だけ遡った時間座標域で極大値
を検出することを特徴とする。In the voice synthesizing method according to the sixth aspect of the present invention, a predetermined value is set according to the type of phoneme of the voice signal, and a maximum value is set in a time coordinate range that is traced back from the temporary driving point by the predetermined value. It is characterized by detecting.

【００３６】以上の構成により、極大値の検出を行う領
域を一定範囲に制限するので、正確なピッチマークの設
定を可能にする。With the above configuration, the region for detecting the maximum value is limited to a certain range, so that the pitch mark can be set accurately.

【００３７】第７の発明に係る音声合成装置は、音声信
号の１ピッチ波形の最初の極大点を検出する極大点検出
手段と、前記音声信号に対して前記最初の極大点を中心
にセンタリングして前記音声信号を切り出す音声信号切
り出し手段と、当該音声信号切り出し手段により切り出
された音声合成素片を記憶しておく音声合成素片記憶手
段と、当該音声合成素片記憶手段に記憶された音声合成
素片中の前記極大点を重畳の中心として、ピッチ周期分
ずらしながら窓掛け重畳する音声合成部とを備えたこと
を特徴とする。A speech synthesizer according to a seventh aspect of the present invention is a maximum point detecting means for detecting a first maximum point of a one-pitch waveform of a voice signal, and centering the voice signal around the first maximum point. Voice signal cutting-out means for cutting out the voice signal, a voice synthesis unit storage means for storing the voice synthesis unit cut out by the voice signal cutting means, and a voice stored in the voice synthesis unit storage means. It is characterized by further comprising a speech synthesis unit for windowing and superimposing while shifting the local maximum point in the synthesis segment as a center of superimposition and shifting it by a pitch period.

【００３８】以上のように、音声信号の１ピッチ波形の
最初の極大点を、切り出す音声信号の中心点にしている
ので、前記第１の発明方法と同様に、重畳する際の中心
点を簡易な処理によって容易に設定することができる。
これにより、ピッチの揺れが少なくスペクトル歪みも小
さくすることができる。この結果、聴感上ゴロゴロした
音が減少した。As described above, since the first maximum point of the one-pitch waveform of the audio signal is set as the center point of the audio signal to be cut out, the center point for superimposing can be simplified as in the first invention method. It can be easily set by various processes.
This makes it possible to reduce pitch fluctuation and reduce spectral distortion. As a result, the rumbling sound was reduced.

【００３９】第８の発明に係る音声合成装置は、前記最
初の極大点検出に際して、前記音声信号の低域濾波波形
を用いることを特徴とする。The speech synthesizer according to the eighth aspect of the invention is characterized by using the low-pass filtered waveform of the speech signal when detecting the first local maximum.

【００４０】以上の構成により、音声信号の低域濾波波
形を用いるので、前記第２の発明方法と同様に、極大点
の検出が容易になり、重畳する際の中心点を容易に設定
することができる。With the above configuration, since the low-pass filtered waveform of the audio signal is used, the maximum point can be easily detected and the center point for superimposing can be easily set, as in the second invention method. You can

【００４１】第９の発明に係る音声合成装置は、音声信
号を分析フレームに分割する分割手段と、前記音声信号
を低域濾波処理する低域濾波手段と、前記分析フレーム
内でフレーム中央近傍の最大値を検出し、その最大値の
時間座標を仮の駆動点として設定する最大値検出手段
と、当該仮の駆動点の時間座標から遡った時間座標域で
極大値を検出する極大値検出手段と、当該極大値検出手
段で検出した全極大値としきい値とを個々に比較する比
較手段と、当該比較手段での比較の結果、極大値の方が
大きい場合に当該極大値の時間座標を駆動点として置き
換え、当該駆動点を中心にセンタリングして前記音声信
号を切り出す音声信号切り出し手段と、当該音声信号切
り出し手段により切り出された音声合成素片を記憶して
おく音声合成素片記憶手段と、当該音声合成素片記憶手
段に記憶された音声合成素片中の前記駆動点を重畳の中
心として、ピッチ周期分ずらしながら窓掛け重畳する音
声合成部とを備えたことを特徴とする。A speech synthesizer according to a ninth aspect of the invention is a dividing means for dividing an audio signal into analysis frames, a low-pass filtering means for low-pass filtering the audio signal, and a portion near the center of the frame within the analysis frame. Maximum value detecting means for detecting the maximum value and setting the time coordinate of the maximum value as a temporary driving point, and maximum value detecting means for detecting the maximum value in the time coordinate range traced back from the time coordinate of the temporary driving point. And, comparing means for individually comparing all the maximum values detected by the maximum value detecting means with the threshold value, and the result of the comparison by the comparing means, if the maximum value is larger, the time coordinate of the maximum value is calculated. A voice signal segmentation unit that replaces as a drive point and centers the drive point to extract the voice signal, and a voice synthesis unit segment that stores the voice synthesis unit segment cut out by the voice signal segmentation unit. Means and a voice synthesizer for windowing and superimposing while shifting the driving point in the voice synthesizer stored in the voice synthesizer storage means as a center of superposition while shifting by a pitch period. .

【００４２】以上のように、フレーム中央近傍の最大値
を仮の駆動点として設定し、その点から遡った時間座標
域で極大値を検出して、前記仮の駆動点との比較におい
て最終的に切り出し中心である駆動点を設定するので、
前記第３の発明方法と同様に、音声生成過程においける
声門の駆動インパルスの影響としての最初のピーク検出
が可能になり、正確なピッチマークの設定が可能にな
る。As described above, the maximum value in the vicinity of the center of the frame is set as the temporary driving point, the maximum value is detected in the time coordinate area traced back from that point, and the final value is compared in comparison with the temporary driving point. Since the driving point that is the center of cutting is set in,
Similar to the method of the third aspect, the first peak can be detected as the influence of the glottal drive impulse in the voice generation process, and the pitch mark can be set accurately.

【００４３】また、スペクトル歪みも小さくなり、聴感
上ゴロゴロした音が減少する。さらに、音声合成時にお
いて、音声波形データを制御しやすくなる。Further, the spectrum distortion is also reduced, and the rumbling sound is reduced. Furthermore, it becomes easy to control the voice waveform data during voice synthesis.

【００４４】第１０の発明に係る音声合成装置は、前記
分析フレーム内でのピッチ周期を検出し、前記仮の駆動
点から当該ピッチ周期の所定数倍の時間座標域内で遡っ
て極大値を検出することを特徴とする。A speech synthesizer according to a tenth aspect of the invention detects a pitch period in the analysis frame and detects a maximum value by tracing back from the temporary driving point in a time coordinate range of a predetermined multiple of the pitch period. It is characterized by doing.

【００４５】以上の構成により、極大値の検出を行う領
域を一定範囲に制限するので、前記第４の発明方法と同
様に、正確なピッチマークの設定を可能にする。With the above construction, the region for detecting the maximum value is limited to a certain range, so that the pitch mark can be set accurately as in the case of the method of the fourth aspect.

【００４６】第１１の発明に係る音声合成装置は、前記
しきい値を、前記分析フレーム内のフレーム中央近傍で
仮の駆動点として検出した最大値の所定数倍に設定した
ことを特徴とする。The speech synthesizer according to the eleventh aspect of the invention is characterized in that the threshold value is set to a predetermined multiple of the maximum value detected as a temporary driving point near the center of the frame in the analysis frame. .

【００４７】以上の構成により、しきい値を、仮の駆動
点として検出した最大値の所定数倍に設定したので、前
記第５の発明方法と同様に、小さい余分な極大値を誤っ
て検出することがなくなる。これにより、雑音成分が重
畳しても、余分な極大値に影響されることなく、安定し
て動作する。With the above configuration, the threshold value is set to a predetermined number of times the maximum value detected as the tentative driving point, so that a small extra maximum value is erroneously detected as in the fifth invention method. There is nothing to do. As a result, even if a noise component is superposed, stable operation is achieved without being affected by an extra maximum value.

【００４８】第１２の発明に係る音声合成装置は、前記
音声信号の音韻の種類に応じて所定値を設定し、前記仮
の駆動点から当該所定値分だけ遡った時間座標域で極大
値を検出することを特徴とする。The voice synthesizer according to the twelfth aspect of the invention sets a predetermined value in accordance with the type of phoneme of the voice signal, and sets a maximum value in the time coordinate area that is traced back by the predetermined value from the temporary driving point. It is characterized by detecting.

【００４９】この場合も、前記第１１の発明と同様に、
正確なピッチマークの設定が可能になり、安定して動作
する。In this case also, as in the eleventh invention,
Accurate pitch mark setting is possible and stable operation is achieved.

【００５０】[0050]

【発明の実施の形態】以下、本発明の実施形態を添付図
面に基づいて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the accompanying drawings.

【００５１】［第１の実施形態］以下、第１の実施形態
に係る音声合成方法及び音声合成装置について説明す
る。図５は第１の実施形態に係る音声合成装置の構成を
示すブロック図である。[First Embodiment] A voice synthesizing method and a voice synthesizing apparatus according to the first embodiment will be described below. FIG. 5 is a block diagram showing the configuration of the speech synthesizer according to the first embodiment.

【００５２】図中のテキスト解析部１０１では、漢字か
な混じり文が入力されると、単語辞書１０２を参照して
形態素解析を行い、漢字かな混じり文の読み、アクセン
ト及びイントネーションを決定し、韻律記号付き発音記
号（中間言語）を出力する。パラメータ生成部１０３で
は、ピッチ周波数パターンや音韻継続時間等の設定を行
う。これらテキスト解析部１０１、単語辞書１０２及び
パラメータ生成部１０３は、従来のものとかわるところ
はない。In the text analysis unit 101 in the figure, when a kanji / kana mixed sentence is input, morphological analysis is performed with reference to the word dictionary 102 to determine readings, accents and intonations of the kanji / kana mixed sentence, and prosodic symbols. Output phonetic symbols (intermediate language). The parameter generator 103 sets a pitch frequency pattern, phoneme duration, and the like. The text analysis unit 101, the word dictionary 102, and the parameter generation unit 103 are the same as the conventional ones.

【００５３】音声合成部１０４では音声合成処理を行
う。即ち、素片辞書１０５内の素片を選択し、窓掛け部
１０６にて、後述の駆動点が重畳の中心となるように後
述の時間窓長Tplの時間窓を前記素片に掛ける窓掛けを
行い、音声合成部１０４で、ピッチ同期波形重畳法にて
音声合成する。The voice synthesis unit 104 performs a voice synthesis process. That is, a segment in the segment dictionary 105 is selected, and a windowing unit 106 applies a time window of a later-described time window length Tpl to the segment so that the driving point described later becomes the center of superimposition. Then, the voice synthesis unit 104 performs voice synthesis by the pitch synchronization waveform superposition method.

【００５４】ここで、時間窓長Tplは、分析時のピッチ
周期をTpa、合成時のピッチ周期をTpsとした場合、 Tpl=C0×min(Tpa,Tps) のように設定する。なお、C0は2.0程度の値である。Here, the time window length Tpl is set as Tpl = C0 × min (Tpa, Tps) where Tpa is the pitch period during analysis and Tps is the pitch period during synthesis. Note that C0 has a value of about 2.0.

【００５５】素片辞書１０５は素片を書き込んだ辞書で
ある。素片は素片作成部１０７にて作成される。素片作
成部１０７は、本発明の主要部分であり、図１のフロー
チャートに示す処理機能を有する。The segment dictionary 105 is a dictionary in which segments are written. The segment is created by the segment creating unit 107. The segment creation unit 107 is the main part of the present invention and has the processing function shown in the flowchart of FIG.

【００５６】この素片作成部１０７での処理を図１に従
って説明する。データディスクなどを備えた音声信号入
力部１０８によって、音声信号が素片作成部１０７に入
力されると、まず、ステップＳ２００で、入力された音
声信号データを分析フレームに分割する。この分析フレ
ームは一定長さの区間に区切られた音声信号データのこ
とで、本実施例では、１フレーム長が３２ｍ秒で、８ｍ
秒ずらして次のフレームに移るように区切られている。
ここでは、総フレーム数をＮとする。また、波形データ
をＸn(l)、［ｎ＝１,２,…,Ｎ（フレーム）、l＝１,２,
…,Ｆｒ（ポイント）］とする。Ｆｒはフレーム長（３
２ｍ秒）とサンプリング周波数ＦSで決まる１フレーム
内の標本数である。本実施形態の場合には、Ｆｒ＝３２
×ＦS／１０００である。The processing in the segment creating unit 107 will be described with reference to FIG. When an audio signal is input to the segment creating unit 107 by the audio signal input unit 108 including a data disc, first, in step S200, the input audio signal data is divided into analysis frames. This analysis frame is audio signal data divided into sections of a fixed length. In this embodiment, one frame length is 32 msec and 8 m
It is delimited so that it shifts to the next frame and moves to the next frame.
Here, the total number of frames is N. In addition, the waveform data is Xn (l), [n = 1, 2, ..., N (frame), l = 1, 2,
..., Fr (points)]. Fr is the frame length (3
2 ms) and the number of samples in one frame determined by the sampling frequency FS. In this embodiment, Fr = 32
× FS / 1000.

【００５７】ステップＳ２０１では、波形データＸn(l)
に対してＬＰＦ（ローパスフィルタ）を掛ける。これ
は、以下に述べるピーク融合処理において、細かい変動
を除去するためである。また、ＬＰＦ処理後の波形デー
タに対してＬＰＦによる遅延を補正する処理を行い、補
正後の波形データをｙn(l)とする。なお、ｎ＝１,２,
…,Ｎ（フレーム）、 l＝１,２,…,Ｆｒ（ポイント）
である。ここで用いるＬＰＦは、後で行う波形に基づく
処理のために、直線位相を持つ公知のＦＩＲ型のディジ
タルフィルタが望ましい。また、このＬＰＦ処理は、予
め一括して行われる。In step S201, the waveform data Xn (l)
Is applied to LPF (low pass filter). This is to remove fine fluctuations in the peak fusion processing described below. Further, the waveform data after the LPF processing is subjected to the processing for correcting the delay due to the LPF, and the corrected waveform data is set to yn (l). Note that n = 1, 2,
…, N (frame), l = 1, 2,…, Fr (point)
Is. The LPF used here is preferably a well-known FIR type digital filter having a linear phase for processing based on a waveform to be performed later. Further, this LPF processing is collectively performed in advance.

【００５８】ステップＳ２０２では、処理を行う分析フ
レームのフレーム番号ｎを初期化する。ステップＳ２０
３では、第ｎフレームにおける音声信号のピッチ周期ｔ
ｐを検出する。このピッチ周期ｔｐを検出する方法に
は、簡易な手法として波形のピーク間隔を検出する方法
等が考えられるが、本実施形態ではケプストラム法を用
いている。これは、より精密にピッチ周期を算出するた
めである。このケプストラム法では、図６に示す処理工
程でピッチ周期ｔｐを検出する。まず、ステップＳ４０
１で時間波形を入力し、ステップＳ４０２で窓掛けを行
う。次いで、窓掛けを行った時間波形に対してステップ
Ｓ４０３で離散フーリエ変換（ＤＦＴ）を施し、ステッ
プＳ４０４でその実部と虚部の二乗和の平方根を対数変
換する。その後、ステップＳ４０５で逆フーリエ変換
（ＩＤＦＴ）を施し、ステップＳ４０６でケプストラム
成分を得て出力する。このように、ケプストラム法は、
畳み込み演算を加法的な演算に変換するものである。音
声の有声音信号は音源成分を声道情報で畳み込んだもの
であるため、ケプストラム法は両者の分離に適してい
る。入力信号が音声の有声音信号の場合、ピッチ周期を
T0とすれば、音源成分は高ケフレンシイ（長時間領域）
のT0の近傍として現れ、声道成分は低ケフレンシイ（短
時間領域）の成分として現れる。ケプストラムからピッ
チ周期を求めるには、高ケフレンシイ部のピークを求め
て、時間原点からこのピークまでの時間を測定すればよ
い。In step S202, the frame number n of the analysis frame to be processed is initialized. Step S20
3, the pitch period t of the audio signal in the nth frame is
Detect p. As a method of detecting the pitch period tp, a method of detecting the peak interval of the waveform can be considered as a simple method, but in the present embodiment, the cepstrum method is used. This is for more accurately calculating the pitch period. In this cepstrum method, the pitch period tp is detected in the processing steps shown in FIG. First, step S40
The time waveform is input in 1 and windowing is performed in step S402. Next, the windowed time waveform is subjected to discrete Fourier transform (DFT) in step S403, and the square root of the sum of squares of its real part and imaginary part is logarithmically converted in step S404. Then, inverse Fourier transform (IDFT) is performed in step S405, and the cepstrum component is obtained and output in step S406. Thus, the cepstrum method is
The convolution operation is converted into an additive operation. Since the voiced sound signal of speech is a convolution of the sound source component with vocal tract information, the cepstrum method is suitable for separating the two. If the input signal is a voiced voice signal, change the pitch period
If T0, the sound source component has high kefrenshii (long time region)
Appears in the vicinity of T0, and the vocal tract component appears as a low-Keffrenchy (short-time region) component. In order to obtain the pitch period from the cepstrum, the peak of the high kefrenshi part can be obtained and the time from the time origin to this peak can be measured.

【００５９】次いで、図１中のステップＳ２０４で、各
分析フレームのフレーム中央近傍の最大値（ｍａｘ）
と、その時間座標ｔｍを検出する。なお、フレーム中央
近傍の決定には、前記ステップＳ２０３で求めたピッチ
周期ｔｐを用いる。フレーム中央の時間座標から前後
に、ピッチ周期ｔｐの所定数倍をフレーム中央近傍とす
る。本実施形態では、フレーム中央の時間座標から前後
に、０．６ｔｐ分を中央近傍とする。これにより、フレ
ーム中央近傍の最大値ｍａｘは、ｍａｘ＝ｍａｘｍｕｍ｛ｙn(l)；l＝（Ｆｒ／２）−
０．６ｔｐ,…,（Ｆｒ／２）＋０．６ｔｐ｝＝ｙn(tm) である。この最大値ｍａｘの時間座標ｔｍを仮の駆動点
として設定する。Then, in step S204 in FIG. 1, the maximum value (max) near the frame center of each analysis frame.
And the time coordinate tm is detected. The pitch period tp obtained in step S203 is used to determine the vicinity of the center of the frame. A predetermined number of times the pitch period tp is set in the vicinity of the frame center before and after the time coordinate of the frame center. In this embodiment, 0.6 tp before and after the time coordinate at the center of the frame is near the center. As a result, the maximum value max near the center of the frame is: max = maxmum {yn (l); l = (Fr / 2) −
0.6tp, ..., (Fr / 2) + 0.6tp} = yn (tm). The time coordinate tm of the maximum value max is set as a temporary driving point.

【００６０】ステップＳ２０５では、仮の駆動点として
設定された時間座標ｔｍから所定値分だけ遡った時間座
標域で全ての極大値を検出する。遡る所定値分の区間は
［ｔｍ−ｔｐ×ａ，ｔｍ）とする。ここで、定数ａは０
から１までの間の値である。この定数ａは、実験による
データを基に最も良好な数値を統計的に算出して設定す
る。なお、本実施形態では、０．２５〜０．４０程度に
設定される。In step S205, all the maximum values are detected in the time coordinate range that is a predetermined value back from the time coordinate tm set as the temporary drive point. The section for a predetermined value traced back is [tm-tp × a, tm). Here, the constant a is 0
It is a value between 1 and 1. This constant a is set by statistically calculating the best numerical value on the basis of experimental data. In this embodiment, it is set to about 0.25 to 0.40.

【００６１】ここでは、極大値をＰｋ、検出された極大
値Ｐｋの総数をＭ、極大値Ｐｋの時間座標をｔｐｋ（ｋ
＝１,２,…,Ｍ）として、Ｐｋ＝ｙn（ｔｐｋ）とする。Here, the maximum value is Pk, the total number of detected maximum values Pk is M, and the time coordinate of the maximum value Pk is tpk (k
= 1, 2, ..., M), and Pk = yn (tpk).

【００６２】次に、ステップＳ２０６〜２１１におい
て、ピーク融合処理を行う。Next, in steps S206 to 211, peak fusion processing is performed.

【００６３】まず、ステップＳ２０６では、ピッチマー
クＫの初期値として、仮の駆動点である最大値の時間座
標ｔｍを設定する。ステップＳ２０７では、最大値カウ
ンタｊをＭにセットする。First, in step S206, the time coordinate tm of the maximum value which is a temporary driving point is set as the initial value of the pitch mark K. In step S207, the maximum value counter j is set to M.

【００６４】ステップＳ２０８では、しきい値と極大値
Ｐｊとの大小を判定する。しきい値は、フレーム中央近
傍の最大値ｍａｘの所定数倍に設定する。即ち、ｍａｘ
×ｂに設定する。ここで、定数ｂは０から１までの間の
値である。この定数ｂは、前記定数ａと同様に、実験に
よるデータを基に最も良好な数値を統計的に算出して設
定する。なお、本実施形態では、０．６〜０．８程度に
設定される。In step S208, the magnitude of the threshold value and the maximum value Pj is determined. The threshold value is set to a predetermined multiple of the maximum value max near the center of the frame. That is, max
Set to xb. Here, the constant b is a value between 0 and 1. As with the constant a, the constant b is set by statistically calculating the best numerical value based on experimental data. In this embodiment, it is set to about 0.6 to 0.8.

【００６５】このステップＳ２０８での判定の結果、極
大値Ｐｊの方がしきい値ｍａｘ×ｂよりも大きいときに
は、ステップＳ２０９に移って、ピッチマークＫの時間
座標ｔｍを前記極大値Ｐｊの時間座標ｔｐｋで置き換え
る。ステップＳ２０８の判定で、極大値Ｐｊの方がｍａ
ｘ×ｂよりも小さいときには、ステップＳ２０９での処
理は行わずにステップＳ２１０にジャンプする。そし
て、ステップＳ２１０で、最大値カウンタｊが１だけ減
算され、ステップＳ２１１で、減算された後の最大値カ
ウンタｊの値が０でないかを判定する。０でない場合は
ステップＳ２０８に戻って、このステップＳ２０８〜ス
テップＳ２１１の処理を繰り返す。これにより、仮に設
定した駆動点から最終的に真の駆動点としてのピッチマ
ークＫを求める。As a result of the determination in step S208, when the maximum value Pj is larger than the threshold value max × b, the process proceeds to step S209, and the time coordinate tm of the pitch mark K is set to the time coordinate of the maximum value Pj. Replace with tpk. In the determination of step S208, the maximum value Pj is ma
If it is smaller than xxb, the process jumps to step S210 without performing the process of step S209. Then, in step S210, the maximum value counter j is decremented by 1, and in step S211, it is determined whether or not the value of the maximum value counter j after the subtraction is 0. If it is not 0, the process returns to step S208, and the processes of steps S208 to S211 are repeated. As a result, the pitch mark K as the true driving point is finally obtained from the temporarily set driving point.

【００６６】図７に、前記ステップＳ２０６〜２１１の
ピーク融合処理の模式図を示す。FIG. 7 shows a schematic diagram of the peak fusion process in steps S206 to 211.

【００６７】図７は／ｒａ／と発音したうちの（／ａ
／）の部分の音声波形であって、図７（Ａ）はＬＰＦ処
理前の音声波形を、図７（Ｂ）はＬＰＦ処理後の低域濾
波波形であって遅延を補正した音声波形を示す。図７
（Ａ）（Ｂ）の波形ともに１ピッチ区間に主要な極大値
が４個ある。図７（Ｂ）の低域濾波波形は、図７（Ａ）
の音声波形と比較して、高周波の細かい変動が除去され
ているため、ピッチマーク付与には好適である。ところ
が、単にＬＰＦ処理後の低域濾波波形からその最大値抽
出を行うと、全てｂ点となってしまう。音声生成過程で
は、声門の駆動インパルスの影響が論理的に最初のピー
クに現れるためと考えられる。このため、単にＬＰＦ処
理後の低域濾波波形に対してピッチマークを付与する
（ｂ点をピッチマークとする）のは不的確である。FIG. 7 shows (/ a) of the pronunciations of / ra /.
7A shows the speech waveform before the LPF processing, and FIG. 7B shows the speech waveform after the LPF processing which is the low-pass filtered waveform with the delay corrected. . Figure 7
In both the waveforms (A) and (B), there are four main maximum values in one pitch section. The low-pass filtered waveform of FIG. 7 (B) is shown in FIG. 7 (A).
Compared with the voice waveform of, the high frequency fine fluctuations are removed, and thus it is suitable for pitch mark addition. However, if the maximum value is simply extracted from the low-pass filtered waveform after the LPF processing, all the points are b points. It is considered that the influence of the glottal drive impulse logically appears at the first peak in the voice generation process. Therefore, it is inaccurate to simply give the pitch mark to the low-pass filtered waveform after the LPF processing (point b is used as the pitch mark).

【００６８】これに対して本実施形態のピーク融合処理
を行うと、ステップＳ２０４で求まった図７（Ｂ）中の
最大値ｂ点に対して、ステップＳ２０５で最大値ｂ点よ
り遡った位置に十分に大きい極大値ａ点が存在し、ステ
ップＳ２０８及びステップＳ２０９によってピッチマー
クＫがａ点に置き換えられる。On the other hand, when the peak fusion processing of the present embodiment is performed, the maximum value b point in FIG. 7 (B) obtained in step S204 is moved back to the maximum value b point in step S205. There is a sufficiently large maximum point a, and the pitch mark K is replaced with the point a in steps S208 and S209.

【００６９】図１中のステップＳ２１２では、ピーク融
合処理により求まったピッチマークＫの前後にそれぞれ
Ｌ分の音声データXnを切り出し、Ｋが中央に位置するよ
うにセンタリングする。なお、ここではＬ分を１２ｍ秒
に設定した。これは、本発明者の予備実験により、女性
よりもピッチ周期の大きい男性で最長のピッチ周期に余
裕を持たせた値である。In step S212 in FIG. 1, L audio data Xn are cut out before and after the pitch mark K obtained by the peak fusion process and centered so that K is located at the center. In addition, L minute was set to 12 msec here. This is a value obtained by a preliminary experiment by the inventor of the present invention in which a man having a larger pitch period than a woman has a margin in the longest pitch period.

【００７０】ステップＳ２１３では、第ｎフレームにお
ける素片として、ステップＳ２１２で切り出した音声デ
ータをデータディスク等の記憶媒体に、素片辞書１０５
として順次書き込みを行う。ステップＳ２１４では、全
分析フレームについて素片の書き込みが終了したか否か
の判定を行う。この書き込みが終了していなければ、ス
テップＳ２１５でフレーム番号を更新してステップＳ２
０３に戻り、ステップＳ２０３からステップＳ２１３ま
での処理を継続する。ステップＳ２１４で全分析フレー
ムの処理が終了したと判定した場合は、素片辞書１０５
のデータディスクのクローズ処理等（図示せず）を行っ
て素片作成部１０７の動作を終了する。In step S213, the speech data extracted in step S212 is stored in a storage medium such as a data disk as a segment in the nth frame in the segment dictionary 105.
Are sequentially written. In step S214, it is determined whether writing of the segment has been completed for all analysis frames. If this writing has not been completed, the frame number is updated in step S215 and then step S2 is performed.
Returning to step 03, the processing from step S203 to step S213 is continued. If it is determined in step S214 that all analysis frames have been processed, the segment dictionary 105
The data disc closing process (not shown) and the like are performed, and the operation of the segment creating unit 107 ends.

【００７１】以上の処理によって作成された素片が書き
込まれた素片辞書１０５内から、対象となる素片が適宜
選択され、窓掛け部１０６にて窓掛けが行われて、音声
合成部１０４で音声合成処理が行われる。The target segment is appropriately selected from the segment dictionary 105 in which the segment created by the above processing is written, and the windowing unit 106 performs windowing, and the speech synthesis unit 104. The voice synthesis process is performed.

【００７２】［効果］（１）以上のように、音声信号の１ピッチ波形の最初
の極大点を、切り出す音声信号の中心点にしているの
で、重畳する際の中心点を簡易な処理によって容易に設
定することができる。これにより、ピッチの揺れが少な
くスペクトル歪みも小さくすることができる。この結
果、聴感上ゴロゴロした音を減少させることができる。[Effects] (1) As described above, since the first maximum point of the 1-pitch waveform of the audio signal is set as the center point of the audio signal to be cut out, the center point at the time of superimposing can be easily performed by simple processing. Can be set to. This makes it possible to reduce pitch fluctuation and reduce spectral distortion. As a result, it is possible to reduce audible rumbling sounds.

【００７３】（２）仮の駆動点を定め、その点から遡
った時間領域で、この駆動点との比較において実際の処
理を行う駆動点を特定するので、声門の駆動インパルス
の影響としての最初の極大点を検出することができ、正
確なピッチマークの設定が可能になる。これにより、ス
ペクトル歪みが小さく、安定して動作すると共に、処理
量の減少等を図ることができる。(2) Since a tentative driving point is defined and the driving point on which actual processing is performed is specified in comparison with this tentative driving point in the time domain traced back from that point, the first driving impulse as the influence of the glottal driving impulse is determined. It is possible to detect the maximum point of, and it is possible to set the pitch mark accurately. As a result, the spectrum distortion is small, the operation is stable, and the processing amount can be reduced.

【００７４】（３）素片の作成に際して、音声信号の
低域濾波波形を用いたので、高周波の細かい変動が除去
されてなだらかな波形になり、極大点の検出が容易にな
る。(3) Since the low-pass filtered waveform of the voice signal is used in the production of the segment, the fine fluctuation of the high frequency is removed to form a smooth waveform, and the maximum point can be easily detected.

【００７５】（４）仮の駆動点ｔｍから極大点検出の
ために遡る時間をｔｐ×ａに制限しているので、遡りす
ぎを防止でき、正確に第１のピークを捕らえることが可
能になる。(4) Since the time for tracing back from the temporary driving point tm to detect the maximum point is limited to tp × a, it is possible to prevent the tracing back too much and accurately capture the first peak. .

【００７６】（５）しきい値を最大値ｍａｘの所定数
倍ｍａｘ×ｂに設定したので、小さい余分な極大値を誤
って融合させることがなくなる。このため、雑音成分が
重畳しても、余分な極大値に影響されることなく、安定
して動作させることができる。(5) Since the threshold value is set to a predetermined multiple of the maximum value max, that is, max × b, it is possible to prevent erroneous fusion of small extra maximum values. Therefore, even if a noise component is superimposed, it is possible to operate stably without being affected by an extra maximum value.

【００７７】（６）一定長さを単位として音声素片を
扱い、フレーム処理を行うことで、音声合成時におい
て、音声波形データを制御しやすくなる。(6) By handling a voice segment with a fixed length as a unit and performing frame processing, voice waveform data can be easily controlled during voice synthesis.

【００７８】［第２の実施形態］次に、本発明の第２の
実施形態について説明する。[Second Embodiment] Next, a second embodiment of the present invention will be described.

【００７９】図８は第２の実施形態に係る音声合成装置
の構成を示すブロック図である。FIG. 8 is a block diagram showing the arrangement of a speech synthesizer according to the second embodiment.

【００８０】本実施形態に係る音声合成装置の全体構成
は、前記第１の実施形態に係る音声合成装置と同様であ
る。本実施形態の音声合成方法の特徴は、素片作成部
を、第１の実施形態に係る素片作成部１０７に対して、
図８のフローチャートに示す処理機能にした点にある。The overall structure of the speech synthesizer according to this embodiment is the same as that of the speech synthesizer according to the first embodiment. The feature of the speech synthesis method of the present embodiment is that the segment creating unit is different from the segment creating unit 107 according to the first embodiment in that
The point is that the processing functions shown in the flowchart of FIG.

【００８１】この素片作成部での処理を図８に従って説
明する。The processing in this segment creating unit will be described with reference to FIG.

【００８２】本実施形態の音声合成装置では、ステップ
Ｓ２００〜ステップＳ２０３までの処理は前記第１の実
施形態形態に係る素片作成部１０７に機能と同様であ
る。即ち、ステップＳ２００において音声信号データを
分析フレームに分割し、ステップＳ２０１においてＬＰ
Ｆを掛け、ステップＳ２０２においてフレーム番号ｎを
初期化し、ステップＳ２０３においてピッチ周期ｔｐを
検出する。なお、これらの処理における詳細は、前記第
１実施形態における各処理と同様である。In the speech synthesizer of this embodiment, the processing from step S200 to step S203 is the same as the function of the segment creating unit 107 according to the first embodiment. That is, the audio signal data is divided into analysis frames in step S200, and the LP is divided in step S201.
F is multiplied, the frame number n is initialized in step S202, and the pitch period tp is detected in step S203. The details of these processes are the same as each process in the first embodiment.

【００８３】次いで、ステップＳ３０３で、当該分析フ
レームの音韻の種類に応じて所定値（ｃ）を設定する。
この所定値（ｃ）は、規則で生成しても、当該分析フレ
ームの音韻を基にしてテーブルを引く構成でもよい。Then, in step S303, a predetermined value (c) is set according to the type of phoneme of the analysis frame.
The predetermined value (c) may be generated by a rule or may be configured to draw a table based on the phoneme of the analysis frame.

【００８４】ステップＳ３０４では、各分析フレームの
フレーム中央近傍の最大値（ｍａｘ）と、その時間座標
ｔｍを検出する。なお、フレーム中央近傍の決定には、
前記ステップＳ２０３で求めたピッチ周期ｔｐを用い、
前記第１実施形態のステップＳ２０４と同様にして行
う。これにより、フレーム中央近傍の最大値ｍａｘは、ｍａｘ＝ｍａｘｍｕｍ｛ｙn(l)；l＝（Ｆｒ／２）−
０．６ｔｐ,…,（Ｆｒ／２）＋０．６ｔｐ｝＝ｙn(tm) となる。この最大値ｍａｘの時間座標ｔｍを仮の駆動点
として設定する。In step S304, the maximum value (max) near the frame center of each analysis frame and its time coordinate tm are detected. In addition, to determine the vicinity of the center of the frame,
Using the pitch period tp obtained in step S203,
The same process as step S204 of the first embodiment is performed. As a result, the maximum value max near the center of the frame is: max = maxmum {yn (l); l = (Fr / 2) −
0.6tp, ..., (Fr / 2) + 0.6tp} = yn (tm). The time coordinate tm of the maximum value max is set as a temporary driving point.

【００８５】ステップＳ３０５では、仮の駆動点として
設定された時間座標ｔｍから所定値ｃ分だけ遡った時間
座標域で全ての極大値を検出する。遡るの区間は［ｔｍ
−ｃ，ｔｍ）である。In step S305, all the maximum values are detected in the time coordinate area which is traced back by the predetermined value c from the time coordinate tm set as the temporary driving point. The section going back is [tm
-C, tm).

【００８６】極大値をＰｋ、検出された極大値Ｐｋの総
数をＭ、極大値Ｐｋの時間座標をｔｐｋ（ｋ＝１,２,
…,Ｍ）として、Ｐｋ＝ｙn（ｔｐｋ）とする。The maximum value is Pk, the total number of detected maximum values Pk is M, and the time coordinate of the maximum value Pk is tpk (k = 1, 2,
, M), and Pk = yn (tpk).

【００８７】次に、ステップＳ３０６〜３１１におい
て、ピーク融合処理を行う。このピーク融合処理も前記
第１実施形態と同様である。即ち、ステップＳ３０６に
おいてピッチマークＫの初期値として時間座標ｔｍを設
定し、ステップＳ３０７において最大値カウンタｊをＭ
にセットし、ステップＳ３０８においてしきい値ｍａｘ
×ｂと極大値Ｐｊとの大小を判定し、ステップＳ３０９
においてピッチマークＫを極大値Ｐｊの時間座標ｔｐｋ
で置き換え、ステップＳ３１０において最大値カウンタ
ｊを１だけ減算し、ステップＳ３１１において最大値カ
ウンタｊの値が０でないかを判定する。そして、最大値
カウンタｊの値が０になるまでステップＳ３０８〜ステ
ップＳ３１１の処理を繰り返す。Next, in steps S306-311, peak fusion processing is performed. This peak fusion processing is also similar to that of the first embodiment. That is, the time coordinate tm is set as the initial value of the pitch mark K in step S306, and the maximum value counter j is set to M in step S307.
To the threshold value max in step S308.
The magnitude between xb and the maximum value Pj is determined, and step S309
At the time coordinate tpk of the maximum value Pj
The maximum value counter j is decremented by 1 in step S310, and it is determined in step S311 whether the value of the maximum value counter j is 0. Then, the processes of steps S308 to S311 are repeated until the value of the maximum value counter j becomes zero.

【００８８】次いで、ステップＳ３１２において音声デ
ータXnを切り出してピッチマークＫを中央にセンタリン
グし、ステップＳ３１３において切り出した音声データ
を素片辞書１０５として順次書き込み、ステップＳ３１
４において全分析フレームについて素片の書き込みが終
了したか否かの判定を行う。この書き込みが終了してい
なければ、ステップＳ３１５でフレーム番号を更新して
ステップＳ２０３に戻り、ステップＳ２０３からステッ
プＳ３１３までの処理を継続する。ステップＳ３１４で
全分析フレームの処理が終了したと判定した場合は、素
片辞書１０５のデータディスクのクローズ処理等（図示
せず）を行って素片作成部１０７の動作を終了する。Next, in step S312, the voice data Xn is cut out and the pitch mark K is centered in the center, and the cut-out voice data is sequentially written as the segment dictionary 105 in step S313, and in step S31.
In 4, it is determined whether or not the writing of the segment has been completed for all the analysis frames. If this writing is not completed, the frame number is updated in step S315, the process returns to step S203, and the processes from step S203 to step S313 are continued. If it is determined in step S314 that the processing of all analysis frames has ended, the data disc closing processing of the element dictionary 105 (not shown) is performed, and the operation of the element generation unit 107 ends.

【００８９】以上の処理によって作成された素片が書き
込まれた素片辞書１０５内から、対象となる素片が適宜
選択され、窓掛け部１０６にて窓掛けが行われて、音声
合成部１０４で音声合成処理が行われる。The target segment is appropriately selected from the segment dictionary 105 in which the segment created by the above processing is written, and windowing is performed by the windowing section 106, and the voice synthesis section 104 is selected. The voice synthesis process is performed.

【００９０】［効果］以上の構成により、前記第１実施
形態に音声合成方法及び音声合成装置と同様の作用、効
果を奏することができる。[Effect] With the above configuration, the same operation and effect as the voice synthesizing method and the voice synthesizing apparatus of the first embodiment can be obtained.

【００９１】さらに、本実施形態では、分析フレームの
音韻の種類に応じて所定値（ｃ）を設定し、仮の駆動点
から遡って極大値を検出する範囲をこの所定値（ｃ）の
範囲に限定するようにしたので、次のような効果を奏す
る。Further, in the present embodiment, the predetermined value (c) is set according to the type of phoneme of the analysis frame, and the range in which the maximum value is detected retroactively from the temporary driving point is the range of the predetermined value (c). Since it is limited to, the following effects can be obtained.

【００９２】（１）音韻の種類に応じて先見的に所定
値（ｃ）を設定することができ、処理するデータに応じ
た適切な範囲の制御が可能になる。(1) The predetermined value (c) can be set a priori according to the type of phoneme, and it becomes possible to control within a proper range according to the data to be processed.

【００９３】（２）遡る範囲を所定値（ｃ）としたの
で、第１実施形態において各分析フレーム毎に行ってい
た「ｔｐ×ａ」の処理が不要となる。これにより、処理
量を減少させることができ、処理装置の簡素化、又は処
理の高速化を図ることができる。(2) Since the range to be traced back is set to the predetermined value (c), the “tp × a” processing which is performed for each analysis frame in the first embodiment is not necessary. As a result, the processing amount can be reduced, and the processing device can be simplified or the processing speed can be increased.

【００９４】［変形例］なお、前記第１,２の実施形態
では、ＬＰＦ処理を一括して行うようにしたが、各フレ
ーム毎に行うようにしてもよい。[Modification] In the first and second embodiments, the LPF processing is performed collectively, but it may be performed for each frame.

【００９５】ピッチ周期検出法としてケプストラム法を
用いたが、他の方法、例えば自己相関法や、線形予測残
差の自己相関である変形自己相関法などの他の方法を用
いるてもよい。Although the cepstrum method is used as the pitch period detection method, another method such as an autocorrelation method or a modified autocorrelation method which is an autocorrelation of linear prediction residuals may be used.

【００９６】また、前記各実施形態の音声合成方法およ
び音声合成装置における素片作成部は、原音声のピッチ
を変化させ、声の高さを変更する、いわゆる音声ピッチ
変換装置でのピッチマーク設定等の、種々の音声出力装
置における処理に適応することが可能である。Further, the voice synthesis method and voice synthesis apparatus of each of the above embodiments uses the voice synthesis unit to change the pitch of the original voice and change the pitch of the voice, that is, pitch mark setting in a so-called voice pitch conversion apparatus. It is possible to adapt to processing in various audio output devices such as.

【００９７】[0097]

【発明の効果】以上、詳細に説明したように、本発明に
よれば、次のような効果を奏することができる。As described in detail above, according to the present invention, the following effects can be obtained.

【００９８】（１）音声信号の１ピッチ波形の最初の
極大点を、切り出す音声信号の中心点にしているので、
波形重畳する際の中心点を容易に設定することができ
る。これにより、ピッチの揺れが少なくスペクトル歪み
も小さくすることができる。この結果、聴感上ゴロゴロ
した音を減少させることができる。(1) Since the first maximum point of the 1-pitch waveform of the audio signal is the center point of the audio signal to be cut out,
It is possible to easily set the center point when the waveforms are superimposed. This makes it possible to reduce pitch fluctuation and reduce spectral distortion. As a result, it is possible to reduce audible rumbling sounds.

【００９９】（２）仮の駆動点を定め、その点から遡
った時間領域で、この駆動点との比較において実際の処
理を行う駆動点を特定するので、声門の駆動インパルス
の影響としての最初の極大点を検出することができ、正
確なピッチマークの設定が可能になる。これにより、ス
ペクトル歪みが小さく、安定して動作すると共に、処理
量の減少等を図ることができる。(2) Since a tentative driving point is defined and the driving point on which actual processing is performed is specified in comparison with this tentative driving point in the time domain traced back from that point, the first driving pulse as the influence of the glottal driving impulse is determined. It is possible to detect the maximum point of, and it is possible to set the pitch mark accurately. As a result, the spectrum distortion is small, the operation is stable, and the processing amount can be reduced.

【０１００】（３）素片の作成に際して、音声信号の
低域濾波波形を用いたので、高周波の細かい変動が除去
されてなだらかな波形になり、極大点の検出が容易にな
る。これにより、重畳する際の中心点を容易に設定する
ことができる。(3) Since the low-pass filtered waveform of the audio signal is used in the production of the segment, the fine fluctuation of the high frequency is removed to form a smooth waveform, and the maximum point can be easily detected. This makes it possible to easily set the center point when superimposing.

【０１０１】（４）仮の駆動点から極大点検出のため
に遡る時間を制限したので、遡りすぎを防止でき、正確
に第１のピークを捕らえることが可能になる。(4) Since the time for tracing back from the tentative driving point for detecting the maximum point is limited, it is possible to prevent the tracing back too much and accurately capture the first peak.

【０１０２】（５）しきい値を最大値の所定数倍に設
定したので、小さい余分な極大値を誤って融合したり、
雑音の影響を受けたりすることがなくなり、安定して動
作させることができるようになる。(5) Since the threshold value is set to a predetermined multiple of the maximum value, a small extra maximum value is erroneously fused,
It will not be affected by noise and will be able to operate stably.

【０１０３】（６）一定長さを単位として音声素片を
扱い、フレーム処理を行うことで、音声合成時におい
て、音声波形データを制御しやすくなる。(6) By handling a voice segment with a fixed length as a unit and performing frame processing, it becomes easy to control voice waveform data during voice synthesis.

[Brief description of drawings]

【図１】本発明の第１の実施形態に係る音声合成装置の
素片作成部での処理機能を示すフローチャートである。FIG. 1 is a flowchart showing a processing function in a segment creating unit of a speech synthesizer according to a first embodiment of the present invention.

【図２】ピッチを変更しながら音声波形を重畳するピッ
チ同期波形重畳法を示す模式図である。FIG. 2 is a schematic diagram showing a pitch synchronization waveform superimposing method for superimposing a voice waveform while changing a pitch.

【図３】／ｈｅ／と発音した／ｅ／の部分の音声波形に
対するピッチマーク設定例を示す模式図である。FIG. 3 is a schematic diagram showing an example of pitch mark setting for a voice waveform of a portion of / e / pronounced / he /.

【図４】／ｍｅ／と発音した／ｅ／の部分の音声波形に
対するピッチマーク設定例を示す模式図である。FIG. 4 is a schematic diagram showing a pitch mark setting example for a voice waveform of a portion of / e / pronounced / me /.

【図５】本発明の第１の実施形態に係る音声合成装置の
構成を示すブロック図である。FIG. 5 is a block diagram showing a configuration of a speech synthesizer according to the first embodiment of the present invention.

【図６】ケプストラム法を説明するフローチャートであ
る。FIG. 6 is a flowchart illustrating a cepstrum method.

【図７】素片作成部でのピーク融合処理の模式図を示
す。FIG. 7 shows a schematic diagram of peak fusion processing in a segment creating unit.

【図８】本発明の第２の実施形態に係る音声合成装置の
素片作成部での処理機能を示すフローチャートである。FIG. 8 is a flowchart showing a processing function in a segment creating unit of the speech synthesis device according to the second embodiment of the present invention.

[Explanation of symbols]

１０１：テキスト解析部、１０２：単語辞書、１０３：
パラメータ生成部、１４：音声合成部、１０５：素片辞
書、１０６：窓掛け部、１０７：素片作成部、１０８：
音声信号入力部。101: text analysis unit, 102: word dictionary, 103:
Parameter generation unit, 14: voice synthesis unit, 105: segment dictionary, 106: windowing unit, 107: segment creation unit, 108:
Audio signal input section.

Claims

(57) [Claims]

1. A speech synthesizer comprising: a step of detecting a first maximum point of a one-pitch waveform of a voice signal; and a step of centering the first maximum point with respect to the voice signal to cut out the voice signal. Create a piece in advance, with the maximum point in the speech synthesis element as the center of superimposition,
A voice synthesis method characterized by performing windowing and superimposing while shifting by a pitch period.

2. The voice synthesizing method according to claim 1, wherein a low-pass filtered waveform of the voice signal is used when the first maximum point is detected.

3. A step of dividing an audio signal into analysis frames, a step of low-pass filtering the audio signal, a maximum value in the vicinity of a frame center in the analysis frame is detected, and a time coordinate of the maximum value is calculated. The process of setting as a temporary driving point, the process of detecting the maximum value in the time coordinate range traced back from the time coordinate of the temporary drive point, and the total maximum value detected in the traced time coordinate range and the threshold value are set. As a result of comparison between the process of individually comparing and the threshold value, when the maximum value is larger, the time coordinate of the maximum value is replaced as a driving point, and the audio signal is centered around the driving point to output the audio signal. Create a voice synthesis unit in advance by the process of cutting out, with the driving point in the voice synthesis unit as the center of superimposition,
A voice synthesis method characterized by performing windowing and superimposing while shifting by a pitch period.

4. The voice synthesis method according to claim 3, wherein the pitch period in the analysis frame is detected, and the maximum value is traced back from the temporary driving point in a time coordinate range that is a predetermined multiple of the pitch period. A method for synthesizing speech, which is characterized by detecting.

5. The speech synthesis method according to claim 3, wherein the threshold value is set to a predetermined multiple of a maximum value detected as a tentative driving point near the center of the frame in the analysis frame. A method for synthesizing speech.

6. The voice synthesizing method according to claim 3, wherein a predetermined value is set according to a phoneme type of the voice signal, and the maximum value is obtained in a time coordinate range that is traced back from the temporary driving point by the predetermined value. A voice synthesis method characterized by detecting a value.

7. A local maximum point detecting means for detecting a first local maximum point of a one-pitch waveform of an audio signal, and an audio signal clipping for centering the first local maximum point with respect to the audio signal to cut out the audio signal. Means, a voice synthesis unit storage means for storing the voice synthesis unit cut out by the voice signal cutout unit, and the maximum point in the voice synthesis unit stored in the voice synthesis unit storage unit. A voice synthesizing apparatus comprising: a voice synthesizing unit that performs windowing and superimposing while shifting by a pitch period as a center of superimposition.

8. The speech synthesizer according to claim 7, wherein a low-pass filtered waveform of the speech signal is used when the first local maximum point is detected.

9. A dividing means for dividing an audio signal into analysis frames, a low-pass filtering means for low-pass filtering the audio signal, a maximum value near a frame center in the analysis frame, and a maximum value thereof. Maximum value detecting means for setting the time coordinate of as the temporary driving point, maximum value detecting means for detecting the maximum value in the time coordinate range traced back from the time coordinate of the temporary driving point, and the maximum value detecting means. The comparison means for individually comparing all the maximum values with the threshold value and the comparison result by the comparison means, when the maximum value is larger, the time coordinate of the maximum value is replaced as the driving point, and the driving point is A voice signal clipping means for centering the voice signal by centering it, a voice synthesis element storage means for storing the voice synthesis element clipped by the voice signal clipping means, and the voice synthesis element. As the center of superimposing the driving point in the speech synthesis fragments stored in 憶 means, speech synthesis apparatus characterized by comprising a speech synthesizer for windowing superimposed while shifting pitch cycle.

10. The speech synthesizer according to claim 9, wherein the pitch period in the analysis frame is detected, and the maximum value is traced back from the temporary driving point within a time coordinate range of a predetermined multiple of the pitch period. A speech synthesizer characterized by detecting the following.

11. The speech synthesis apparatus according to claim 9, wherein the threshold value is set to a predetermined multiple of a maximum value detected as a tentative driving point near the center of the frame in the analysis frame. A speech synthesizer characterized by.

12. The speech synthesizer according to claim 9, wherein a predetermined value is set according to a phoneme type of the voice signal, and the maximum value is obtained in a time coordinate range that is traced back from the temporary driving point by the predetermined value. A voice synthesizer characterized by detecting a value.