JPH0990987A

JPH0990987A - Method and device for voice synthesis

Info

Publication number: JPH0990987A
Application number: JP7247716A
Authority: JP
Inventors: Yoshinori Shiga; 芳則志賀
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1995-09-26
Filing date: 1995-09-26
Publication date: 1997-04-04

Abstract

PROBLEM TO BE SOLVED: To generate smooth pitch patterns without performing a wasteful point pitch setting. SOLUTION: A morpheme analysis is conducted against an inputted KANJI (Chinese characters) and KANA (Japanese alphabets) mixed sentence in a language analysis processing section 11 and reading and accent information is generated. In a voice synthesis section 2, a phoneme continuation time computation processing section 21 decides the phoneme continuation time based on the reading information. An every phoneme point pitch setting processing section 22 sets point pitch positions at one or two points in four equally divided continuation time of each phoneme based on the accent information and the height of the pitch is set at that location. The setting time of the point pitch and the frequency are decided by the phoneme position which sets the accent type and the pitch and the kind of the phoneme or phoneme environment. The obtained phoneme information, the phoneme continuation time and the point pitch setting position are written into a file 31 as a voice symbol column by a voice symbol column generation processing section 23. Then, the pitch pattern of the voice to be synthesized based on the phoneme continuation time of the voice symbol column and the point pitch position and phoneme parameters are generated based on the phoneme information and finally a voice is synthesized.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、人間の発声する音
声に近い滑らかな抑揚を得るのに好適な音声合成方法及
び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and apparatus suitable for obtaining a smooth intonation similar to a human voice.

【０００２】[0002]

【従来の技術】この種の音声合成装置として、テキスト
（文）を音韻と韻律からなる記号列（音声記号列）に変
換し、その記号列から音声を生成する文音声合成（Text
-to-Speech：ＴＴＳ）処理（文音声変換処理）を行なう
音声規則合成装置が知られている。この音声規則合成装
置における文音声合成処理は、大別して言語処理部と音
声合成部とに分けられ、日本語の規則合成を例にとる
と、次のように行なわれるのが一般的であった。2. Description of the Related Art As a speech synthesizer of this type, a text-to-speech synthesizer (Text) that converts a text (sentence) into a symbol string (speech symbol string) composed of phonemes and prosody and generates speech from the symbol string.
There is known a speech rule synthesizing device that performs -to-Speech (TTS) processing (sentence speech conversion processing). The sentence speech synthesis processing in this speech rule synthesis apparatus is roughly divided into a language processing unit and a speech synthesis unit. Taking Japanese rule synthesis as an example, it is generally performed as follows. .

【０００３】まず言語処理部では、入力されるテキスト
（漢字かな混じり文）に対して形態素解析並びに構文解
析等の言語処理を加え、形態素への分解、係り受け関係
の推定等の処理を行なうと同時に、各形態素に読み並び
にアクセント型を与える。その後言語処理部では、アク
セントに関しては複合語等のアクセント移動規則を用い
て、読み上げの際の区切りとなる句（以下、アクセント
句と称する）毎のアクセント型を決定する。通常音声規
則合成装置の言語処理部では、こうして得られるアクセ
ント句毎の読み並びにアクセント型を記号列（音声記号
列）として出力できるようになっている。First, in the language processing unit, linguistic processing such as morphological analysis and syntactic analysis is performed on the input text (Kanji / Kana mixed sentence) to perform processing such as decomposition into morphemes and estimation of dependency relationships. At the same time, the reading and accent types are given to each morpheme. After that, the language processing unit determines the accent type for each phrase (hereinafter, referred to as an accent phrase) that is a delimiter at the time of reading, by using the accent movement rule such as a compound word for the accent. In the language processing unit of the normal speech rule synthesizing device, the reading and accent type for each accent phrase thus obtained can be output as a symbol string (speech symbol string).

【０００４】次に音声合成部では、得られた読みに含ま
れる各音韻の継続時間を、その音韻の音韻環境等をもと
に、所定の規則により決定する。続いて音声合成では、
上記のようにして得られる「読み」と「音韻の継続時
間」に従って、予め用意されている音声素片ファイル、
即ち音声の特徴パラメータを所定の合成単位、例えば子
音＋母音（以下、ＣＶと称する）の単位で蓄積した音声
素片ファイルより順次音声素片を読み出し、その音声素
片を接続して、合成すべき音声の特徴パラメータ系列を
生成する。Next, the voice synthesis unit determines the duration of each phoneme included in the obtained reading according to a predetermined rule based on the phoneme environment of the phoneme. Then in speech synthesis,
According to the "reading" and "phoneme duration" obtained as described above, a speech unit file prepared in advance,
That is, the speech feature files are sequentially read from a speech unit file in which the characteristic parameters of the speech are stored in a predetermined synthesis unit, for example, a unit of consonant + vowel (hereinafter, referred to as CV), and the speech units are connected and synthesized. Generate a characteristic parameter sequence of the power speech.

【０００５】更に音声合成部では、上記アクセント型を
もとにピッチの高低変化が生じる時点にて点ピッチを設
定し、複数設定された点ピッチ間を直線補間することに
よりピッチのアクセント成分を生成し、これにイントネ
ーション成分（通常は周波数対数軸上での単調減少直
線）を重畳してピッチパターンを生成する。そして音声
合成部では、有声区間では韻律パラメータから得られる
周期パルスを、無声区間ではランダムノイズを、それぞ
れ音源信号として、音声の特徴パラメータ系列からフィ
ルタ係数を算出し、合成器フィルタに与えて所望の音声
を合成する。Further, in the voice synthesizing unit, a point pitch is set based on the accent type at the time when the pitch changes in height, and linearly interpolates between a plurality of set point pitches to generate a pitch accent component. Then, an intonation component (usually a monotonically decreasing straight line on the frequency logarithmic axis) is superimposed on this to generate a pitch pattern. Then, in the voice synthesis unit, the periodic pulse obtained from the prosody parameter in the voiced section and the random noise in the unvoiced section are used as the sound source signals, respectively, to calculate the filter coefficient from the feature parameter series of the voice, and apply it to the synthesizer filter to obtain a desired value. Synthesize voice.

【０００６】また上記したように、言語処理部ではアク
セント句毎の読み並びにアクセント型を記号列として出
力できるので、言語解析結果が正しくない場合は、出力
した記号列を編集してから、音声合成処部に入力して音
声を合成することができる。Further, as described above, the language processing unit can output the reading and accent type for each accent phrase as a symbol string. Therefore, if the linguistic analysis result is incorrect, the outputted symbol string is edited and then speech synthesis is performed. A voice can be synthesized by inputting it to the processing unit.

【０００７】[0007]

【発明が解決しようとする課題】従来より、上記のよう
な音声合成技術が存在しているが、当該音声合成技術を
適用した音声合成装置で生成されるピッチパターンに
は、次のような問題があった。Conventionally, the above-described speech synthesis technique has been available, but the following problems occur in the pitch pattern generated by the speech synthesis device to which the speech synthesis technique is applied. was there.

【０００８】即ち、従来の音声合成技術では、アクセン
ト型をもとにピッチのアクセント成分を生成する際、図
１６に示すように、点ピッチを一つの音節（１音節）に
つき１ピッチしか設定しないので、細かなピッチ制御が
できないばかりか、滑らかなピッチパターンを得ること
ができないという問題がある。That is, in the conventional speech synthesis technique, when the pitch accent component is generated based on the accent type, as shown in FIG. 16, the point pitch is set to only one pitch for each syllable (one syllable). Therefore, there is a problem that a fine pitch control cannot be performed and a smooth pitch pattern cannot be obtained.

【０００９】これを改善する手法としては、特開平３−
１６４８００号公報に記載の発明のように、１音節あた
りに設定する点ピッチ数を増やせばよいが、単にこれだ
けでは必要のない点ピッチを設定せねばならないことも
多い。As a method for improving this, Japanese Patent Laid-Open No.
As in the invention described in Japanese Patent No. 164800, it is sufficient to increase the number of point pitches set per syllable, but in many cases it is necessary to set a point pitch that is not necessary.

【００１０】更には、点ピッチの設定位置を「先頭から
何番目の音節の先頭の時点から何ｍｓ（ミリ秒）後に設
定する」といったような指定方法により、ピッチの細か
な制御を可能にする方法も考えられるが、アクセント型
が同じであれば絶えず先頭から同番目の音韻の同時間後
の時点に点ピッチを設定するため、滑らかなピッチパタ
ーンを得ることができず、合成される音声の抑揚が不自
然になることが頻発する可能性がある。Further, a finer control of the pitch is made possible by a designation method such as "setting the point pitch setting position after several ms (milliseconds) from the time point of the beginning of what syllable from the beginning". A method is conceivable, but if the accent type is the same, the point pitch is constantly set at the same time after the same phoneme from the beginning, so a smooth pitch pattern cannot be obtained and the synthesized speech It is possible that intonation becomes unnatural.

【００１１】また、従来技術には次のような問題点も存
在していた。即ち従来の音声合成装置では、言語処理に
より、合成すべき音声の読みとアクセント型とを一旦記
号列（音声記号列）として出力し、これを音声合成処理
に入力することで音声を合成することができるが、記号
列には、合成音声に抑揚を与えるためのパラメータとし
てアクセント型しか与えられていないため、ピッチパタ
ーン生成の細かな制御は音声合成処理に委ねられ、利用
者が独自に滑らかなピッチパターンを作成することは不
可能であった。Further, the prior art has the following problems. That is, in a conventional speech synthesizer, speech reading and accent type to be synthesized are once output as a symbol string (speech symbol string) by language processing, and this is input to the speech synthesis processing to synthesize speech. However, since only the accent type is given to the symbol string as a parameter for giving the intonation to the synthesized speech, the fine control of pitch pattern generation is entrusted to the speech synthesis processing, and the user can make the smoothness independently. It was impossible to create a pitch pattern.

【００１２】本発明はこのような事情を考慮してなされ
たものでその目的は、無駄な点ピッチを設定することな
く、細かなピッチ制御を可能とし、滑らかなピッチパタ
ーンが生成できる音声合成方法及び装置を提供すること
にある。The present invention has been made in consideration of such circumstances, and an object thereof is to realize a voice synthesizing method which enables fine pitch control without setting an unnecessary point pitch and can generate a smooth pitch pattern. And to provide a device.

【００１３】本発明の他の目的は、利用者が音声記号列
の変更によってピッチパターンを制御でき、しかもこの
音声記号列によるピッチパターンの制御が簡単に行な
え、少数の指定のみで滑らかなピッチパターンが生成で
きる音声合成方法及び装置を提供することにある。Another object of the present invention is that the user can control the pitch pattern by changing the phonetic symbol string, the pitch pattern can be easily controlled by this phonetic symbol string, and a smooth pitch pattern can be obtained by only a small number of designations. It is to provide a voice synthesizing method and device capable of generating a voice.

【００１４】本発明の更に他の目的は、音声合成技術に
精通していない利用者でも容易に音声記号列を編集で
き、人間の発声した音声と変わらないピッチパターンを
作成することができる記号列編集装置を提供することを
目的とする。Still another object of the present invention is that a user who is not familiar with voice synthesis technology can easily edit a voice symbol string and can create a pitch pattern that is the same as a voice uttered by a human. An object is to provide an editing device.

【００１５】[0015]

【課題を解決するための手段】上記課題を解決するため
に、本発明の第１の観点に係る音声合成方法は、語また
は句のアクセント情報に基づいて複数の点ピッチを与
え、与えられた複数の点ピッチ間を補間することにより
生成されるピッチパターンに基づいて、音声を合成する
音声合成方法であって、点ピッチを与える時間軸上の位
置を、当該点ピッチを与えるべき音韻の音韻環境または
音韻種の少なくとも一方に基づいて決定することを特徴
とするものである。In order to solve the above-mentioned problems, the speech synthesis method according to the first aspect of the present invention provides a plurality of point pitches on the basis of accent information of a word or a phrase. A voice synthesis method for synthesizing a voice based on a pitch pattern generated by interpolating between a plurality of point pitches, wherein a position on the time axis at which the point pitch is given is a phoneme of a phoneme to give the point pitch. It is characterized in that it is determined based on at least one of the environment and the phoneme type.

【００１６】次に、本発明の第２の観点に係る音声合成
装置は、音声の特徴パラメータを所定の合成単位からな
る音声素片により蓄積しておくための音声素片記憶媒体
と、入力される音韻情報に基づいて、上記音声素片記憶
媒体より音声素片を読み出し、この読み出した各音声素
片間を接続して合成すべき音声の特徴パラメータを生成
する特徴パラメータ生成処理手段と、上記音韻情報に対
応して入力されるアクセント情報を利用して音韻毎の点
ピッチ設定を行なう音韻毎点ピッチ設定処理手段であっ
て、設定される点ピッチの属する音韻の音韻環境または
音韻種の少なくとも一方に基づいて点ピッチの時間軸上
の位置を決定する音韻毎点ピッチ設定処理手段と、この
音韻毎点ピッチ設定処理手段により設定された隣接する
点ピッチの間を補間して、音声のピッチパターンを生成
するピッチパターン生成処理手段と、上記特徴パラメー
タ生成処理手段によって生成された合成すべき音声の特
徴パラメータと、ピッチパターン生成処理手段によって
生成されたピッチパターンとから音声を合成する合成フ
ィルタ処理手段とを備えたことを特徴とするものであ
る。Next, the speech synthesis apparatus according to the second aspect of the present invention is input with a speech unit storage medium for accumulating speech characteristic parameters by speech units consisting of predetermined synthesis units. Characteristic parameter generation processing means for generating a characteristic parameter of a speech to be synthesized by connecting the read speech units from the speech unit storage medium based on the phoneme information to be synthesized; It is a phoneme-based point pitch setting processing means for setting a point pitch for each phoneme using accent information input corresponding to phoneme information, and at least the phoneme environment or phoneme type of the phoneme to which the set point pitch belongs. The point pitch setting processing means for each phoneme that determines the position of the point pitch on the time axis based on one of the points and the adjacent point pitch set by the point pitch setting processing means for each phoneme are complemented. The pitch pattern generation processing means for generating a pitch pattern of the voice, the characteristic parameter of the voice to be synthesized generated by the characteristic parameter generation processing means, and the pitch pattern generated by the pitch pattern generation processing means And a synthesizing filter processing means for synthesizing.

【００１７】次に、本発明の第３の観点に係る音声合成
方法は、上記第１の観点に係る音声合成方法とは異な
り、点ピッチを与える時間軸上の位置が、音韻ではなく
て、点ピッチを与えるべき音節の音節環境または音節種
の少なくとも一方に基づいて決定されるようにしたこと
を特徴とする。Next, the voice synthesizing method according to the third aspect of the present invention is different from the voice synthesizing method according to the first aspect, in that the position on the time axis where the point pitch is given is not a phoneme, It is characterized in that the point pitch is determined based on at least one of the syllable environment and the syllable type of the syllable to be given.

【００１８】次に、本発明の第４の観点に係る音声合成
装置は、上記第２の観点に係る音声合成装置における音
韻毎点ピッチ設定処理手段に代えて、音韻情報に対応し
て入力されるアクセント情報を利用して音節毎の点ピッ
チ設定を行なう音節毎点ピッチ設定処理手段であって、
設定される点ピッチの属する音節の音節環境または音節
種の少なくとも一方に基づいて点ピッチの時間軸上の位
置を決定する音節毎点ピッチ設定処理手段を設け、この
音節毎点ピッチ設定処理手段により設定された隣接する
点ピッチの間を補間して、音声のピッチパターンが生成
される構成としたことを特徴とする。Next, in the voice synthesizing apparatus according to the fourth aspect of the present invention, instead of the phoneme per-point pitch setting processing means in the voice synthesizing apparatus according to the second aspect, input is made corresponding to phonological information. Syllabic point pitch setting processing means for performing point pitch setting for each syllable using accent information,
A syllable-by-syllable pitch setting processing means is provided for determining the position of the point-pitch on the time axis based on at least one of the syllable environment of the syllable to which the set point pitch belongs and the syllable type. It is characterized in that a pitch pattern of a voice is generated by interpolating between the set adjacent point pitches.

【００１９】次に、本発明の第５の観点に係る音声合成
方法及び第６の観点に係る音声合成装置は、音韻毎の点
ピッチ設定のために、一つの音韻の継続時間内に少なく
とも４点の点ピッチ指定可能時点を設定し、そのうちの
２点以下の時点に対して点ピッチを指定するようにした
ことを特徴とする。Next, the voice synthesizing method according to the fifth aspect of the present invention and the voice synthesizing apparatus according to the sixth aspect of the present invention set at least 4 points within the duration of one phoneme in order to set the point pitch for each phoneme. It is characterized in that the point pitch specifiable time points of points are set, and the point pitch is designated for two or less points.

【００２０】次に、本発明の第７の観点に係る音声合成
方法及び第８の観点に係る音声合成装置は、上記第５の
観点に係る音声合成方法及び第６の観点に係る音声合成
装置とは異なって、音韻毎の点ピッチ設定ではなくて、
音節毎の点ピッチ設定のために、一つの音節の継続時間
内に少なくとも４点の点ピッチ指定可能時点を設定し、
そのうちの２点以下の時点に対して点ピッチを指定する
ようにしたことを特徴とする。Next, a voice synthesizing method according to a seventh aspect of the present invention and a voice synthesizing apparatus according to the eighth aspect are a voice synthesizing method according to the fifth aspect and a voice synthesizing apparatus according to the sixth aspect. Unlike the point pitch setting for each phoneme,
To set the point pitch for each syllable, set at least four points that can be specified for the point pitch within the duration of one syllable.
It is characterized in that the point pitch is designated for two or less points among them.

【００２１】次に、本発明の第９の観点に係る音声合成
装置は、上記第２、第４、第６または第８の観点に係る
音声合成装置に、音声合成の対象となるテキストを解析
して音韻情報とアクセント情報を生成する言語解析処理
手段を追加して、文音声変換（ＴＴＳ）処理（文音声合
成処理）機能を持たせたことを特徴とする。Next, the speech synthesizer according to the ninth aspect of the present invention analyzes the text to be synthesized by the speech synthesizer according to the second, fourth, sixth or eighth aspect. It is characterized in that a language analysis processing means for generating phonological information and accent information is added to provide a text-to-speech conversion (TTS) processing (sentence-speech synthesis processing) function.

【００２２】次に、本発明の第１０の観点に係る音声合
成方法は、音声の音韻情報とピッチ情報を含む韻律情報
を記述した音声記号列を入力として、予め用意された複
数の音声の特徴パラメータからなる音声素片の群の中か
ら、上記音声記号列中の音韻情報に従って複数の音声素
片を選択して接続して音声の音韻を表現するパラメータ
を生成すると共に、上記音声記号列中の韻律情報に従っ
て音声のピッチパターンを生成し、上記音声の音韻を表
現するパラメータ及びピッチパターンをもとに音声を合
成する音声合成方法であって、上記音声記号列中のピッ
チ情報が点ピッチで与えられるようにしたことを特徴と
する。Next, in the speech synthesis method according to the tenth aspect of the present invention, a plurality of speech features prepared in advance are input with a speech symbol string describing prosody information including phonological information and pitch information of the speech as an input. From the group of speech units consisting of parameters, a plurality of speech units are selected and connected according to the phoneme information in the speech symbol string to generate parameters for expressing the phoneme of the speech, and in the speech symbol string. A pitch synthesizing method for generating a pitch pattern of a voice according to the prosody information of the voice, and synthesizing a voice based on a pitch pattern and a parameter expressing the phonology of the voice, wherein the pitch information in the voice symbol string is a point pitch. It is characterized by having been given.

【００２３】次に、本発明の第１１の観点に係る音声合
成装置は、予め用意された複数の音声の特徴パラメータ
からなる音声素片の群を蓄積しておくための音声素片記
憶媒体と、合成すべき音声の音韻情報と点ピッチで与え
られたピッチ情報を含む韻律情報を記述した音声記号列
中の少なくとも音韻情報を入力し、当該音韻情報に従っ
て上記音声素片記憶媒体から複数の音声素片を選択して
接続して音声の音韻を表現するパラメータを生成する音
韻パラメータ生成手段と、上記音声記号列中の少なくと
も韻律情報を入力し、当該韻律情報に従って合成すべき
音声のピッチパターンを生成する韻律パラメータ生成手
段と、上記音韻パラメータ生成手段で生成された音韻を
表現するパラメータ及び上記韻律パラメータ生成手段で
生成されたピッチパターンから音声を合成する合成フィ
ルタ処理手段とを備えたことを特徴とする。Next, a speech synthesizer according to an eleventh aspect of the present invention comprises a speech unit storage medium for accumulating a group of speech units, each of which is prepared in advance and composed of a plurality of speech characteristic parameters. Inputting at least phonological information in a phonetic symbol sequence describing prosodic information including phonological information of speech to be synthesized and pitch information given by a point pitch, and inputting at least phonological information from the speech element storage medium according to the phonological information. Phonological parameter generating means for generating a parameter expressing a phoneme of a voice by selecting and connecting phonemes, and at least prosodic information in the phonetic symbol string are input, and a pitch pattern of a voice to be synthesized according to the prosodic information is input. Prosody parameter generating means for generating, parameters expressing the phoneme generated by the phonological parameter generating means, and pitch generated by the prosody parameter generating means Characterized in that a synthesizing filter processing means for synthesizing a speech from the turn.

【００２４】次に、本発明の第１２の観点に係る音声合
成方法及び第１３の観点に係る音声合成装置は、音声記
号列中のピッチ情報が、上記第１０の観点に係る音声合
成方法及び第１１の観点に係る音声合成装置で適用され
るピッチ情報とは異なり、単に点ピッチで与えられるだ
けでなく、その点ピッチの時間軸上の位置が各音韻の開
始点を基準に点ピッチで指定されるようにしたことを特
徴とする。Next, in the voice synthesizing method according to the twelfth aspect of the present invention and the voice synthesizing apparatus according to the thirteenth aspect, the pitch synthesizing method according to the tenth aspect provides the pitch information in the voice symbol sequence. Different from the pitch information applied in the speech synthesizer according to the eleventh aspect, not only is it given as a point pitch, but the position on the time axis of that point pitch is also a point pitch based on the starting point of each phoneme. It is characterized in that it is specified.

【００２５】次に、本発明の第１４の観点に係る音声合
成方法及び第１５の観点に係る音声合成装置は、音声記
号列中のピッチ情報が点ピッチで与えられるだけでな
く、その点ピッチの時間軸上の位置が、上記第１２の観
点に係る音声合成方法及び第１３の観点に係る音声合成
装置で適用されるピッチ情報とは異なり、各音韻の開始
点ではなくて、各音節の開始点を基準に点ピッチで指定
されるようにしたことを特徴とする。Next, in the voice synthesizing method according to the fourteenth aspect of the present invention and the voice synthesizing apparatus according to the fifteenth aspect, not only the pitch information in the voice symbol string is given as a point pitch, but also the point pitch. The position on the time axis is different from the pitch information applied in the speech synthesis method according to the twelfth aspect and the speech synthesis device according to the thirteenth aspect, and is not the start point of each phoneme but the syllable of each syllable. It is characterized in that the point pitch is specified based on the starting point.

【００２６】次に、本発明の第１６の観点に係る音声合
成方法及び第１７の観点に係る音声合成装置は、音声記
号列中の音韻情報に含まれる各音韻の継続時間内に少な
くとも４点の点ピッチ指定可能時点を設定し、そのうち
の２点以下の時点に対して点ピッチを指定することで音
声記号列中の韻律情報が記述されるようにしたことを特
徴とする。Next, the speech synthesis method according to the sixteenth aspect of the present invention and the speech synthesis device according to the seventeenth aspect include at least four points within the duration of each phoneme included in the phoneme information in the phonetic symbol sequence. Is set, and the prosodic information in the phonetic symbol string is described by setting the point pitches for the time points of two points or less among them.

【００２７】次に、本発明の第１８の観点に係る音声合
成方法及び第１９の観点に係る音声合成装置は、音声記
号列中の音韻情報に含まれる各音韻の継続時間内に少な
くとも４点の点ピッチ指定可能時点を設定し、そのうち
の２点以下の時点に対して、各音韻の開始点を基準に点
ピッチの時間軸上の位置を指定することで音声記号列中
の韻律情報が記述されるようにしたことを特徴とする。Next, the speech synthesizing method according to the eighteenth aspect of the present invention and the speech synthesizing apparatus according to the nineteenth aspect include at least four points within the duration of each phoneme included in the phoneme information in the phonetic symbol sequence. By setting the point pitch specifiable time points of, and specifying the position on the time axis of the point pitch based on the starting point of each phoneme for the time points of 2 points or less, the prosodic information in the phonetic symbol string is It is characterized by being described.

【００２８】次に、本発明の第２０の観点に係る音声合
成方法及び第２１の観点に係る音声合成装置は、音声記
号列中の音韻情報に含まれる各音節の継続時間内に少な
くとも４点の点ピッチ指定可能時点を設定し、そのうち
の２点以下の時点に対して点ピッチを指定することで音
声記号列中の韻律情報が記述されるようにしたことを特
徴とする。Next, the speech synthesis method according to the twentieth aspect of the present invention and the speech synthesis apparatus according to the twenty-first aspect include at least four points within the duration of each syllable included in the phoneme information in the phonetic symbol string. Is set, and the prosodic information in the phonetic symbol string is described by setting the point pitches for the time points of two points or less among them.

【００２９】次に、本発明の第２２の観点に係る音声合
成方法及び第２３の観点に係る音声合成装置は、音声記
号列中の音韻情報に含まれる各音節の継続時間内に少な
くとも４点の点ピッチ指定可能時点を設定し、そのうち
の２点以下の時点に対して、各音節の開始点を基準に点
ピッチの時間軸上の位置を指定することで音声記号列中
の韻律情報が記述されるようにしたことを特徴とする。Next, the speech synthesis method according to the twenty-second aspect of the present invention and the speech synthesis device according to the twenty-third aspect include at least four points within the duration of each syllable included in the phoneme information in the phonetic symbol sequence. The point probable time in the phonetic symbol sequence is set by setting the points on the time axis that can be specified as the point pitch, and the points on the time axis based on the start point of each syllable are specified for the points less than 2 points. It is characterized by being described.

【００３０】次に、本発明の第２４の観点に係る音声記
号列編集装置は、音声の音韻情報と点ピッチで与えられ
たピッチ情報を含む韻律情報を記述した音声記号列を入
力として、予め用意された複数の音声の特徴パラメータ
からなる音声素片の群の中から、前記音声記号列中の音
韻情報に従って複数の音声素片を選択して接続して音声
の音韻を表現するパラメータを生成すると共に、上記音
声記号列中の韻律情報に従って音声のピッチパターンを
生成し、上記音声の音韻を表現するパラメータ及びピッ
チパターンをもとに音声を合成する音声合成装置に適用
される音声記号列編集装置であって、上記音声記号列中
のピッチ情報に従って複数の点ピッチを表示するための
表示手段と、この表示手段により表示されている点ピッ
チの変更を指示するための入力手段と、この入力手段に
よる変更指示に応じて表示中の該当する点ピッチを変更
し、その変更結果を上記音声記号列に反映する音声記号
列修正処理手段とを備えたことを特徴とする。Next, a phonetic symbol string editing apparatus according to a twenty-fourth aspect of the present invention inputs a phonetic symbol string in which prosodic information including phonological information of a voice and pitch information given by point pitch is described as an input. From a group of prepared speech units consisting of a plurality of characteristic parameters of speech, a plurality of speech units are selected and connected in accordance with the phoneme information in the speech symbol sequence to generate a parameter expressing a phoneme of speech. At the same time, a voice symbol string edit is applied to a voice synthesizer that generates a voice pitch pattern according to the prosodic information in the voice symbol string and synthesizes the voice based on the parameters and pitch pattern expressing the phoneme of the voice. A device for indicating a plurality of point pitches according to the pitch information in the above-mentioned phonetic symbol sequence, and indicating a change of the point pitch displayed by this display means. And a voice symbol string correction processing unit for changing the corresponding point pitch in the display in response to a change instruction from the input unit and reflecting the change result in the voice symbol string. And

【００３１】次に、本発明の第２５の観点に係る音声記
号列編集装置は、上記第２４の観点に係る音声記号列編
集装置における表示手段、入力手段及び音声記号列修正
処理手段に代えて、音声記号列中のピッチ情報に従って
複数の点ピッチをグラフ表示するための表示手段、この
表示手段により表示されている点ピッチの位置を移動さ
せる指示を入力するための入力手段、及びこの入力手段
による移動指示に応じて表示中の該当する点ピッチの位
置を画面上で移動し、その移動後の新たな点ピッチの位
置を上記音声記号列に反映する音声記号列修正処理手段
を備えたことを特徴とする。Next, a phonetic symbol string editing apparatus according to a twenty-fifth aspect of the present invention is replaced with the display means, the input means and the phonetic symbol string correction processing means in the phonetic symbol string editing apparatus according to the twenty-fourth aspect. Display means for graphically displaying a plurality of point pitches according to the pitch information in the phonetic symbol sequence, input means for inputting an instruction to move the position of the point pitch displayed by the display means, and this input means A voice symbol string correction processing means for moving the position of the corresponding point pitch being displayed on the screen in response to a movement instruction by the user and reflecting the new position of the new point pitch after the movement in the voice symbol string. Is characterized by.

【００３２】次に、本発明の第２６の観点に係る音声記
号列編集装置は、上記第２５の観点に係る音声記号列編
集装置における表示手段に代えて、上記音声記号列中の
ピッチ情報に従って、複数の点ピッチと、これらの点ピ
ッチ間を補間したピッチ外形をグラフ表示するための表
示手段を備えたことを特徴とする。Next, a phonetic symbol string editing apparatus according to a twenty-sixth aspect of the present invention uses the pitch information in the phonetic symbol string in place of the display means in the phonetic symbol string editing apparatus according to the twenty-fifth aspect. A display means for displaying a plurality of point pitches and a pitch contour obtained by interpolating these point pitches in a graph is provided.

【００３３】次に、本発明の第２７の観点に係る音声記
号列編集装置は、上記第２５または第２６の観点に係る
音声記号列編集装置における表示手段に、上記音声記号
列中の音韻情報に対応する参照音声を分析して得られる
ピッチ外形が参照パターンとしてグラフ表示されるよう
にしたことを特徴とする。Next, in the phonetic symbol string editing device according to the twenty-seventh aspect of the present invention, the phoneme information in the phonetic symbol string is displayed on the display means of the phonetic symbol string editing device according to the twenty-fifth or twenty-sixth aspect. The pitch contour obtained by analyzing the reference voice corresponding to is displayed as a reference pattern in a graph.

【００３４】以上のような構成により、上記第１乃至第
４の観点に係る発明においては、従来に比べて滑らかで
人間の声の抑揚により近いピッチパターンを、音韻情報
とアクセント型から生成することができる。With the above-described structure, in the invention according to the first to fourth aspects, a pitch pattern that is smoother and closer to the intonation of the human voice is generated from the phonological information and the accent type as compared with the prior art. You can

【００３５】また、上記第５乃至第８の観点に係る発明
においては、設定する点ピッチの数を最小限に抑えなが
ら、極めて細かなピッチ制御を可能にし、ひいては人間
の声の抑揚により近い音声が合成できる。Further, in the inventions according to the fifth to eighth aspects, extremely fine pitch control is enabled while the number of set point pitches is minimized, and by extension, a voice closer to the suppression of human voice is provided. Can be synthesized.

【００３６】そして、上記第９の観点に係る発明におい
ては、テキストから音声を合成する文音声変換（ＴＴ
Ｓ）機能を実現しながら、上記第２、第４、第６または
第８の観点に係る発明におけるのと同様の効果を得るこ
とが可能となる。In the invention according to the ninth aspect, sentence-speech conversion (TT) for synthesizing speech from text.
While realizing the S) function, it is possible to obtain the same effect as that of the invention according to the second, fourth, sixth or eighth aspect.

【００３７】また、上記第１０乃至第２８の観点に係る
発明においては、利用者が、最小限の労力で簡単に、利
用者自身の好みのピッチパターンを音声合成に供するこ
とが可能となる。Further, in the inventions according to the tenth to twenty-eighth aspects, it becomes possible for a user to easily provide a user's favorite pitch pattern for voice synthesis with a minimum of labor.

【００３８】[0038]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。［第１の実施形態］図１は本発明の第１の実施形態に係
る音声合成装置の概略構成を示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. [First Embodiment] FIG. 1 is a block diagram showing the schematic arrangement of a speech synthesis apparatus according to the first embodiment of the present invention.

【００３９】この音声合成装置は、例えばパーソナルコ
ンピュータ等の情報処理装置上で専用のソフトウェア
（文音声変換ソフトウェア）を実行することにより実現
されるもので、文音声変換（ＴＴＳ）処理機能、即ちテ
キストから音声を生成する文音声変換処理（文音声合成
処理）機能を有しており、その機能構成は、大別して言
語処理部１、音声合成部２及び音声記号列編集部３とに
分けられる。This speech synthesizer is realized by executing dedicated software (sentence / speech conversion software) on an information processing apparatus such as a personal computer, and has a speech / speech conversion (TTS) processing function, that is, a text. It has a sentence-to-speech conversion processing (sentence-to-speech synthesis processing) function for generating speech from, and its functional configuration is roughly divided into a language processing section 1, a speech synthesis section 2, and a speech symbol string editing section 3.

【００４０】言語処理部１は、入力文、例えば漢字かな
混じり文を解析して読み情報とアクセント情報を生成す
る処理を司る。音声合成部２は、言語処理部１での文解
析結果をもとに、音韻情報、各音韻の継続時間及び音韻
毎の点ピッチ位置が記述された音声記号列を生成する処
理と、この音声記号列をもとに音声を生成する処理を司
る。音声記号列編集部３は、音声合成部２にて変換生成
された、或いは利用者が作成した音声記号列を利用者操
作に応じて編集する処理を司る。The language processing unit 1 manages a process of analyzing an input sentence, for example, a kanji / kana mixed sentence, to generate reading information and accent information. The speech synthesizing unit 2 generates a speech symbol string in which the phoneme information, the duration of each phoneme, and the point pitch position for each phoneme are described based on the sentence analysis result in the language processing unit 1. It controls the process of generating speech based on the symbol string. The voice symbol sequence editing unit 3 controls the process of editing the voice symbol sequence converted and generated by the voice synthesizing unit 2 or created by the user according to the user operation.

【００４１】さて、図１の音声合成装置において、文音
声変換（読み上げ）の対象となる文書（ここでは日本語
文書）はテキストファイル（図示せず）として保存され
ている。本装置では、文音声変換ソフトウェアに従い、
当該ファイルから漢字かな混じり文を１文ずつ読み出し
て、言語処理部１及び音声合成部２により以下に述べる
文音声変換処理を行ない、音声を合成する。In the speech synthesizer shown in FIG. 1, a document (here, Japanese document) to be subjected to sentence-to-speech conversion (speech) is stored as a text file (not shown). In this device, according to the sentence voice conversion software,
Sentences containing kanji and kana are read one by one from the file, and the language processing unit 1 and the speech synthesis unit 2 perform sentence-speech conversion processing described below to synthesize speech.

【００４２】まずテキストファイルから読み出された漢
字かな混じり文は、言語処理部１内の言語解析処理部１
１に入力される。言語解析処理部１１は、入力される漢
字かな混じり文の形態素解析を行ない、読み情報とアク
セント情報を生成する。形態素解析とは、与えられた文
の中で、どの文字列が語句を構成しているか、そしてそ
の語の構造がどのようなものかを解析する作業である。First, the kana-kana mixed sentence read from the text file is processed by the language analysis processing unit 1 in the language processing unit 1.
Input to 1. The linguistic analysis processing unit 11 performs morphological analysis of an input kanji / kana mixed sentence and generates reading information and accent information. Morphological analysis is an operation of analyzing which character string forms a phrase in a given sentence, and what the structure of the word is.

【００４３】そのために、言語解析処理部１１は、文の
最小構成要素である「形態素」を見出し語に持つ形態素
辞書１２と、形態素間の接続規則が登録されている接続
規則ファイル１３を利用する。即ち言語解析処理部１１
は、入力文と形態素辞書１２とを照合することで得られ
る全ての形態素系列候補を求め（総当たり法）、その中
から、接続規則ファイル１３を参照して文法的に前後に
接続できる組み合わせを出力する。形態素辞書１２に
は、解析時に用いられる文法情報と共に、形態素の読み
並びにアクセントの型が登録されている。このため、形
態素解析により形態素が定まれば、同時に読みとアクセ
ント型も与えることができる。For that purpose, the language analysis processing unit 11 uses a morpheme dictionary 12 having a morpheme, which is the minimum constituent element of a sentence, as an entry word, and a connection rule file 13 in which connection rules between morphemes are registered. . That is, the language analysis processing unit 11
Finds all the morpheme sequence candidates obtained by matching the input sentence with the morpheme dictionary 12 (brute force method), and from among them, refers to the connection rule file 13 to find a combination that can be connected grammatically before and after. Output. In the morpheme dictionary 12, grammatical information used at the time of analysis and morpheme reading and accent types are registered. For this reason, if a morpheme is determined by morphological analysis, reading and accent type can be given at the same time.

【００４４】例えば、「公園へ行って本を読みます。」
という文に対して形態素解析を行なうと、／公園／へ／行って／本／を／読み／ます／。と形態素に分割される。同時に、各形態素に読みとアク
セント型が与えられ、／ｋｏ：ｅＮ／ｅ／ｉｑｔｅ／ｈｏ＾Ｎ／ｏ／ｙｏｍｉ
／ｍａ＾ｓｕ／となる。ここで、記号「Ｎ」は撥音（「ん」）を、
「ｑ」は促音（「っ」）を、「：」は長音（「ー」）
を、それぞれ表す。また、「＾」の入っている形態素
は、その直前の音韻でピッチが高く、その直後の音韻で
はピッチが落ちるアクセントであることを意味する。ま
た、「＾」がない場合は、平板型のアクセントであるこ
とを意味する。For example, "Go to the park and read a book."
If you perform a morphological analysis on the sentence, / park / go / go / book / read / read /. And morphemes. At the same time, a reading and an accent type are given to each morpheme, and / ko: eN / e / iqte / ho ^ N / o / yomi
/ Ma ^ su /. Here, the symbol “N” is the sound repellency (“n”),
"Q" is a consonant sound ("tsu"), and ":" is a long sound ("-")
Respectively are represented. In addition, a morpheme containing "^" means that the pitch is high in the phoneme immediately before that and the pitch is dropped in the phoneme immediately after that. When there is no "^", it means that the accent is a flat plate type.

【００４５】ところで、人間が文章を読むときには、こ
のような形態素単位でアクセントを付けて読むことはせ
ず、幾つかの形態素をひとまとめにして、そのまとまり
にアクセントを付けて読んでいる。By the way, when humans read a sentence, they do not read by adding accents in units of such morphemes, but by reading several morphemes together and adding accents to the unit.

【００４６】そこで、このようなことを考慮して、言語
解析処理部１１では更に、一つのアクセントを与える単
位（以下、アクセント句と称する）で形態素をまとめる
と同時に、まとめたことによるアクセントの移動も推定
する。これに加えて言語解析処理部１１は、母音の無声
化や読み上げの際のポーズ（息継ぎ）等の情報も付加
し、上記の例では、最終的に次のような読み情報を生成
する。Therefore, in consideration of such a situation, the language analysis processing unit 11 further collects morphemes in units of giving one accent (hereinafter referred to as an accent phrase) and, at the same time, moves the accents due to the collection. Also estimate. In addition to this, the language analysis processing unit 11 also adds information such as vowel devoicing and pauses (breathing) during reading, and in the above example, finally generates the following reading information.

【００４７】／ｋｏ：ｅＮｅ／ｉｑｔｅ．／ｈｏ＾Ｎｏ
／ｙｏｍｉｍａ＾ｓ（ｕ）／ここで、ピリオド「．」は息継ぎを、「（）」は無声
化した母音を表す。さて、上記のようにして言語処理部
１内の言語解析処理部１１により読み情報が生成される
と、音声合成部２内の音韻継続時間計算処理部２１が起
動される。/ Ko: eNe / iqte. / Ho ^ No
/ Yomima ^ s (u) / Here, the period "." Represents breathing, and "()" represents unvoiced vowels. Now, when the reading information is generated by the language analysis processing unit 11 in the language processing unit 1 as described above, the phoneme duration calculation processing unit 21 in the speech synthesis unit 2 is activated.

【００４８】音韻継続時間計算処理部２１は、言語解析
処理部１１で生成した読み情報に従って、入力文に含ま
れる各音韻の継続時間（単位はｍｓ）を決定する。この
音韻継続時間計算処理部２１での継続時間の決定処理
は、子音（Ｃ）と母音（Ｖ）の境界（ＣＶわたり）の位
置が等間隔に並ぶようにするという、極めて簡単なアル
ゴリズムにより実現されている。この具体例を、アクセ
ント句「ｙｏｍｉｍａｓｕ」について図２に示す。The phoneme duration calculation processing unit 21 determines the duration (unit is ms) of each phoneme included in the input sentence according to the reading information generated by the language analysis processing unit 11. The determination process of the duration in the phoneme duration calculation processing unit 21 is realized by an extremely simple algorithm in which the positions of the boundary (CV crossing) between the consonant (C) and the vowel (V) are arranged at equal intervals. Has been done. A specific example of this is shown in FIG. 2 for the accent phrase “yomimasu”.

【００４９】音韻継続時間計算処理部２１により入力文
に含まれる各音韻の継続時間が決定されると、同じ音声
合成部２内の音韻毎点ピッチ設定処理部２２が起動され
る。音韻毎点ピッチ設定処理部２２は、音韻継続時間計
算処理部２１により決定された音韻継続時間と、言語解
析処理部１１により決定されたアクセント情報に基づい
て、点ピッチ位置を設定する。ここでは、点ピッチ設定
の時間として、各音韻の継続時間を等間隔で４分割（即
ち４等分）したうちの１点または２点が決定される。こ
の点ピッチ設定の具体例を、図２のように音韻継続時間
が決定されたアクセント句「ｙｏｍｉｍａｓｕ」の場合
について、図３に示す。なお、図３では、横軸が時間
を、縦軸がピッチ（単位はオクターブｏｃｔ）を表し、
白丸の位置が点ピッチ位置である。ここで点ピッチ位置
（ｘ，ｙ）は、音韻継続時間を４等分した時点０〜３
（但し、句末の音韻の最終時点は４）を示すｘと、ピッ
チ（単位ｏｃｔ）を示すｙとにより表される。When the phoneme duration calculation processing unit 21 determines the duration of each phoneme included in the input sentence, the phoneme point pitch setting processing unit 22 in the same speech synthesis unit 2 is activated. The phoneme-based point pitch setting processing unit 22 sets the point pitch position based on the phoneme duration determined by the phoneme duration calculation processing unit 21 and the accent information determined by the language analysis processing unit 11. Here, as the time for setting the point pitch, one point or two points of the duration of each phoneme divided into four at equal intervals (that is, equally divided into four) is determined. A specific example of this point pitch setting is shown in FIG. 3 for the case of the accent phrase “yomimasu” whose phoneme duration is determined as shown in FIG. In FIG. 3, the horizontal axis represents time and the vertical axis represents pitch (unit is octave oct),
The position of the white circle is the point pitch position. Here, the point pitch position (x, y) is 0 to 3 when the phoneme duration is divided into four equal parts.
(However, the final time point of the phoneme at the end of the phrase is 4) and y indicating the pitch (unit oct).

【００５０】本実施形態において、点ピッチ設定の時間
と周波数（ピッチ周波数）は、アクセント型と、ピッチ
の設定される音韻がアクセント句の何番目に位置するか
と、更にその音韻の種類或いは音韻環境によって決ま
る。これについて、図４（ａ），（ｂ）を参照して以下
に詳述する。In the present embodiment, the time and frequency (pitch frequency) of the point pitch setting are the accent type, the position of the phoneme for which the pitch is set in the accent phrase, the type of the phoneme or the phoneme environment. Depends on This will be described in detail below with reference to FIGS. 4 (a) and 4 (b).

【００５１】まず図４（ａ）は、アクセント句が子音か
ら始まる平板型または起伏型アクセントのアクセント句
の先頭（句頭）から数えて最初の母音、即ちアクセント
句先頭子音の直後の母音に設定される点ピッチ位置の付
与ルールを示す。First, in FIG. 4A, the accent phrase is set to the first vowel counting from the beginning (phrase) of the accent phrase of the flat plate type or undulating accent starting from a consonant, that is, the vowel immediately after the first consonant of the accent phrase. The following is a rule of assigning a point pitch position to be used.

【００５２】この図４（ａ）から明らかなように、アク
セント句が子音で始まる場合、その直後に位置する母音
内での点ピッチの指定の時点は、直前の子音が有声子音
であるか無声子音であるかと、その母音の直後の音韻が
撥音であるか否か、即ち前後の音韻の種類、言い換えれ
ば音韻環境によって決まる。この（点ピッチ設定の対象
となる母音の）前後の音韻の種類の組み合わせ（音韻環
境）は４種類あり、その組み合わせ数分の点ピッチ位置
付与ルール（ルールＮｏ．（１）〜（４））が用意され
ている。As is clear from FIG. 4 (a), when the accent phrase starts with a consonant, at the point in time when the point pitch is specified in the vowel located immediately after it, the immediately preceding consonant is a voiced consonant or unvoiced. It depends on whether it is a consonant and whether or not the phoneme immediately after the vowel is a syllable, that is, the type of phoneme before and after, in other words, the phoneme environment. There are four combinations (phoneme environments) of phoneme types before and after this (vowel to be set for point pitch) (phoneme environment), and point pitch position assignment rules (rule Nos. (1) to (4)) corresponding to the number of combinations. Is prepared.

【００５３】次に図４（ｂ）は、平板型または起伏型ア
クセントのアクセント句先頭から母音または撥音のみを
数えたときに２番目（第２音節）となる母音または撥音
に設定される点ピッチ位置の付与ルールを示す。Next, FIG. 4B shows the point pitch set for the second (second syllable) vowel or vowel when only the vowel or vowel is counted from the beginning of the accent phrase of the flat plate type or undulating accent. The position assignment rule is shown.

【００５４】この図４（ｂ）から明らかなように、上記
２番目となる母音または撥音内での点ピッチ指定時点
は、それが撥音であるか否か、即ちその音韻自身の種類
（音韻種）により決まる。ここでは、当該音韻種が撥音
である場合の点ピッチ位置付与ルール（ルールＮｏ．
（１））と当該音韻種が撥音以外である場合の点ピッチ
位置付与ルール（ルールＮｏ．（２））とが用意されて
いる。As is apparent from FIG. 4 (b), the point pitch designation time point in the second vowel or vowel sound is whether or not it is syllable, that is, the type of the phoneme itself (phoneme type). ). Here, a point pitch position assignment rule (rule No.
(1)) and a point pitch position assignment rule (rule No. (2)) when the phoneme type is other than sound repellency are prepared.

【００５５】以上の図４（ａ），（ｂ）に示した点ピッ
チ位置付与ルールに基づいて決定される点ピッチ位置
を、４つのアクセント句の例について図５（ａ）〜
（ｄ）に示す。この図５（ａ）〜（ｄ）では、図３と同
様に、横軸が時間を、縦軸がピッチ（単位はオクターブ
ｏｃｔ）を表し、白丸の位置が点ピッチ位置である。The point pitch positions determined based on the point pitch position assignment rules shown in FIGS. 4A and 4B are shown in FIGS.
It shows in (d). In FIGS. 5A to 5D, as in FIG. 3, the horizontal axis represents time, the vertical axis represents pitch (unit is octave oct), and the positions of white circles are point pitch positions.

【００５６】まず、アクセント句先頭の音韻が有声子音
で、第２音節が撥音でないアクセント句の場合、例えば
「ｙｏｍｉｍａｓｕ（読みます）」というアクセント句
の場合であれば、図４（ａ）中のルールＮｏ．（４）と
図４（ｂ）中のルールＮｏ．（２）とが適用され、図５
（ａ）に示すような点ピッチ位置が生成（設定）され
る。First, in the case where the phoneme at the beginning of the accent phrase is a voiced consonant and the second syllable is not a sound utterance, for example, in the case of the accent phrase "yomimasu (read)", it is shown in FIG. Rule No. (4) and the rule No. in FIG. (2) and are applied, and FIG.
A point pitch position as shown in (a) is generated (set).

【００５７】次に、アクセント句先頭の音韻が有声子音
で、第２音節が撥音であるアクセント句の場合、例えば
「ｒｏＮｚｉｍａｓｕ（論じます）」というアクセント
句の場合であれば、図４（ａ）中のルールＮｏ．（３）
と図４（ｂ）中のルールＮｏ．（１）とが適用され、図
５（ｂ）に示すような点ピッチ位置が生成される。Next, in the case of an accent phrase in which the phoneme at the beginning of the accent phrase is a voiced consonant and the second syllable is a syllable, for example, in the case of the accent phrase "roNzimasu (discussed)", FIG. Rule No. (3)
And the rule No. in FIG. (1) and are applied to generate a point pitch position as shown in FIG.

【００５８】次に、アクセント句先頭の音韻が無声子音
で、第２音節が撥音でないアクセント句の場合、例えば
「ｔｏｒｉｍａｓｕ（取ります）」というアクセント句
の場合であれば、図４（ａ）中のルールＮｏ．（２）と
図４（ｂ）中のルールＮｏ．（２）とが適用され、図５
（ｃ）に示すような点ピッチ位置が生成される。Next, in the case where the phoneme at the beginning of the accent phrase is an unvoiced consonant and the second syllable is an unaccented accent phrase, for example, in the case of the accent phrase "trimasu", in FIG. 4A. Rule No. Rule No. 2 in (2) and FIG. (2) and are applied, and FIG.
A point pitch position as shown in (c) is generated.

【００５９】次に、アクセント句先頭の音韻が無声子音
で、第２音節が撥音であるアクセント句の場合、例えば
「ｋａＮｚｉｍａｓｕ（感じます）」というアクセント
句の場合であれば、図４（ａ）中のルールＮｏ．（１）
と図４（ｂ）中のルールＮｏ．（１）とが適用され、図
５（ｄ）に示すような点ピッチ位置が生成される。Next, in the case of an accent phrase in which the phoneme at the beginning of the accent phrase is a voiceless consonant and the second syllable is a syllable, for example, in the case of the accent phrase "kaNzimasu (feel)", FIG. Rule No. (1)
And the rule No. in FIG. (1) is applied to generate a point pitch position as shown in FIG.

【００６０】このようにして生成された複数の点ピッチ
位置を直線で補間して得られるピッチパターン（図５
（ａ）〜（ｄ）中の太線部分）は、実際の人間の発声す
る音声のそれをよく近似している。A pitch pattern obtained by interpolating a plurality of point pitch positions generated in this way with a straight line (see FIG. 5).
(Thick line portions in (a) to (d)) closely approximate that of a voice uttered by an actual human.

【００６１】実際の音声の平板型または起伏型アクセン
トのピッチパターンでは、アクセント区先頭子音が無声
子音のときよりも有声子音のときの方が、低いピッチか
ら始まる傾向がある（図５（ａ）と図５（ｃ）または図
５（ｂ）と図５（ｄ）を比較すると、子音部先頭のピッ
チは全て０［ｏｃｔ］から始まっているが、無声子音部
では音声合成時にはピッチは使われないため、直後の母
音部の先頭０．３［ｏｃｔ］からピッチが与えられ、結
果的に有声子音で始まるアクセント句より出始めのピッ
チが高くなる）。また、第２音節が撥音の場合には、ピ
ッチの上昇の開始時点と終了時点がそうでない場合より
早くなる傾向があることが知られている。In a pitch pattern of an actual flat or undulating accent of a voice, a voiced consonant tends to start at a lower pitch than an unvoiced consonant as a leading consonant (FIG. 5 (a)). 5 (c) or FIG. 5 (b) and FIG. 5 (d), all the pitches at the beginning of the consonant part start from 0 [oct], but in the unvoiced consonant part, the pitch is not used during voice synthesis. Since it is not present, the pitch is given from the beginning 0.3 [oct] of the vowel part immediately after, and as a result, the pitch at the beginning of the accent is higher than that of the accent phrase starting with the voiced consonant. Further, it is known that when the second syllable is sound repellant, the start time and the end time of the pitch rise tend to be earlier than when it is not.

【００６２】従来は、このような実際の音声のピッチ特
徴を全く考慮せず、音韻種や音韻環境に関係なくピッチ
を与えていた。即ち、図１６で示したような従来技術で
は、音節に対してその中心に一つだけの点ピッチしか設
定できないため、開始ピッチの設定については第１音節
の点ピッチの上下によってかろうじて変えることが可能
であるものの、ピッチ上昇の開始時点及びピッチ上昇の
終了時点は、それぞれ第１音節の中心時点と第２音節の
中心時点にしか設定できない。このため従来技術では、
実際の人間の音声のピッチパターンに近づけるのは困難
であった。Conventionally, the pitch has been given regardless of the phonological species and the phonological environment without considering the actual pitch characteristics of the voice. That is, in the conventional technique as shown in FIG. 16, since only one point pitch can be set at the center of the syllable, the setting of the start pitch can be barely changed depending on the point pitch of the first syllable. Although possible, the start time of the pitch rise and the end time of the pitch rise can be set only at the center time of the first syllable and the center time of the second syllable, respectively. Therefore, in the conventional technology,
It was difficult to approximate the pitch pattern of the actual human voice.

【００６３】ところで、精密なピッチ制御を行なうに
は、特開平３−１６４８００号公報でも指摘しているよ
うに、１音節の継続時間を４等分した程度の間隔で点ピ
ッチを設定する必要がある。音韻で考えた場合も、子音
部の無い母音は音韻と音節とが等しくなることから、各
音韻を例えば４等分した時点に点ピッチを与えるのは妥
当である。但し、４等分した場合でも、全ての時点にピ
ッチ周波数を与えるのは効率が悪く、１乃至２点で十分
であり、全く点ピッチを与えなくともよい音韻もある。By the way, in order to carry out precise pitch control, it is necessary to set the point pitch at intervals such that the duration of one syllable is divided into four equal parts, as pointed out in Japanese Patent Laid-Open No. 3-164800. is there. Even when considered in terms of phonemes, since vowels without consonant parts have the same phonemes and syllables, it is appropriate to give a point pitch to each phoneme when it is divided into, for example, four equal parts. However, even if it is divided into four parts, it is inefficient to give the pitch frequency at all points, and one or two points are sufficient, and there is a phoneme that does not need to give the point pitch at all.

【００６４】勿論、図３の例から分かるように、子音部
分は比較的音韻の継続時間が短いので分割が細かくな
り、点ピッチを指定する時点の間隔が狭くなるが、細か
く点ピッチが指定できることがピッチ制御に悪影響を及
ぼすことはないし、また母音と同じ数だけ分割した方が
処理上都合がよい。なお、音韻が子音のとき分割数を変
えることも考えられるが、この後生成される音声記号列
を利用者が書き換える際に混乱をきたす恐れがある。Of course, as can be seen from the example of FIG. 3, since the consonant part has a relatively short phoneme duration, the division is fine and the interval at the point pitch designation is narrow, but the point pitch can be finely designated. Does not adversely affect pitch control, and it is convenient for processing to divide the same number of vowels. It is also possible to change the number of divisions when the phoneme is a consonant, but this may cause confusion when the user rewrites the phonetic symbol string generated thereafter.

【００６５】さて、音韻毎点ピッチ設定処理部２２によ
り音韻毎の点ピッチ位置が決定されると、同じ音声合成
部２内の音声記号列生成処理部２３が起動される。音声
記号列生成処理部２３は、言語解析処理部１１により得
られた音韻情報（音韻系列）、音韻継続時間計算処理部
２１により得られた各音韻の継続時間、及び音韻毎点ピ
ッチ設定処理部２２により得られた点ピッチ位置から、
これらを記述した音声記号列を生成する。Now, when the point pitch position for each phoneme is determined by the phoneme-based point pitch setting processing section 22, the phonetic symbol string generation processing section 23 in the same speech synthesis section 2 is activated. The phonetic symbol string generation processing unit 23 includes the phoneme information (phoneme sequence) obtained by the language analysis processing unit 11, the phoneme durations obtained by the phoneme duration calculation processing unit 21, and the phoneme per-point pitch setting processing unit. From the point pitch position obtained by No. 22,
A phonetic symbol string describing these is generated.

【００６６】図６は、アクセント句「ｙｏｍｉｍａｓｕ
（読みます）」についての音声記号列の例を示す。図か
ら明らかなように、音声記号列は、音韻情報（音韻系
列）６１０と韻律情報６２０とからなる。この韻律情報
６２０は、イントネーション成分６２１と、音韻情報６
１０の各音韻毎に決定された音韻継続時間と点ピッチ位
置の情報（音韻継続時間・ピッチ情報）６２２からな
る。イントネーション成分６２１はピッチの自然降下成
分を示すもので、図６中の「１６６，０．２」の記述例
は、１６６Ｈｚ（ヘルツ）から句末に向かって０．２
［ｏｃｔ］降下するピッチの自然降下成分を示してい
る。また、図６中の音声記号列中の「；」は、アクセン
ト句の終端を示す。FIG. 6 shows the accent phrase "yomimasu".
An example of a phonetic symbol string for "(read)" is shown. As is clear from the figure, the phonetic symbol string includes phoneme information (phoneme sequence) 610 and prosody information 620. This prosody information 620 includes intonation component 621 and phonological information 6
10 phoneme durations and point pitch position information (phoneme duration / pitch information) 622 determined for each phoneme. The intonation component 621 indicates the pitch spontaneous drop component, and the description example of “166, 0.2” in FIG. 6 is 0.2 from 166 Hz (hertz) toward the end of the phrase.
[Oct] Shows the natural falling component of the falling pitch. Further, “;” in the phonetic symbol string in FIG. 6 indicates the end of the accent phrase.

【００６７】音声記号列生成処理部２３は、生成した音
声記号列を音声合成部２内のピッチパターン生成処理部
２４に（図示せぬメモリを通して）渡して、このピッチ
パターン生成処理部２４により処理を進めさせること
も、当該音声記号列を音声記号列編集部３にて編集する
ために一旦音声記号列編集部３内の音声記号列ファイル
３１に書き込むことも可能である。この音声記号列編集
部３による音声記号列ファイル３１内の音声記号列の編
集処理については、後述する第２の実施形態にて詳述す
ることにし、ここでは説明を省略する。但し、第２の実
施形態では、音声記号列編集部３に代えて音声記号列編
集部６が用いられている。The speech symbol string generation processing section 23 passes the generated speech symbol string to the pitch pattern generation processing section 24 in the speech synthesis section 2 (through a memory not shown), and the pitch pattern generation processing section 24 processes it. Alternatively, the phonetic symbol string editing unit 3 can edit the phonetic symbol string once and write the phonetic symbol string in the phonetic symbol string file 31 in the phonetic symbol string editing unit 3. The edit processing of the voice symbol sequence in the voice symbol sequence file 31 by the voice symbol sequence editing unit 3 will be described in detail in a second embodiment described later, and the description thereof is omitted here. However, in the second embodiment, the phonetic symbol string editing unit 6 is used instead of the phonetic symbol string editing unit 3.

【００６８】さて、音声合成部２内のピッチパターン生
成処理部２４は、音声記号列生成処理部２３から渡され
る、或いは音声記号列ファイル３１に格納されている音
声記号列をもとにピッチパターン（韻律パラメータ）を
生成する処理を行なう。即ちピッチパターン生成処理部
２４は、対象となる音声記号列中に設定されている各音
韻毎の音韻継続時間と点ピッチ位置をもとに、各点ピッ
チ間を直線補間してピッチのアクセント成分を例えば１
０ｍｓ（ミリ秒）毎に生成する。ピッチパターン生成処
理部２４は更に、対象となる音声記号列中に設定されて
いるイントネーション成分（ピッチの自然降下成分）を
加えて、合成すべき音声の１０ｍｓ毎のピッチパターン
を得る。Now, the pitch pattern generation processing unit 24 in the voice synthesis unit 2 is based on the voice symbol string passed from the voice symbol string generation processing unit 23 or stored in the voice symbol string file 31. A process of generating (prosodic parameter) is performed. That is, the pitch pattern generation processing unit 24 linearly interpolates between the point pitches based on the phoneme duration and the point pitch position for each phoneme set in the target phonetic symbol string to perform pitch accent components. For example 1
It is generated every 0 ms (millisecond). The pitch pattern generation processing unit 24 further adds an intonation component (spontaneous drop component of pitch) set in the target speech symbol sequence to obtain a pitch pattern for every 10 ms of speech to be synthesized.

【００６９】一方、音声合成部２内の音韻パラメータ生
成処理部２５は、音声記号列生成処理部２３から渡され
る、或いは音声記号列ファイル３１に格納されている音
声記号列中の音韻情報（音韻系列）をもとに音韻パラメ
ータを生成する処理を、例えばピッチパターン生成処理
部２４によるピッチパターン生成処理と並行して次のよ
うに行なう。On the other hand, the phoneme parameter generation processing unit 25 in the speech synthesis unit 2 receives the phoneme information (phoneme) in the phonetic symbol string passed from the phonetic symbol string generation processing unit 23 or stored in the phonetic symbol string file 31. The process of generating a phoneme parameter based on the sequence) is performed in parallel with the pitch pattern generation process by the pitch pattern generation processing unit 24 as follows, for example.

【００７０】まず本実施形態では、実音声を改良ケプス
トラム法により窓長２０ｍｓ、窓周期１０ｍｓで分析し
て得た１９次のケプストラムを子音＋母音（ＣＶ）の単
位で切り出した計１３７個の音声素片が蓄積された音声
素片ファイル（図示せず）が用意されている。この音声
素片ファイルの内容は、文音声変換ソフトウェアに従う
文音声変換処理の開始時に、例えばメインメモリ（図示
せず）に確保された音声素片領域（以下、音声素片メモ
リと称する）２６に読み込まれているものとする。First, in the present embodiment, a total of 137 voices obtained by cutting out a 19th-order cepstrum obtained by analyzing an actual voice with a window length of 20 ms and a window period of 10 ms by a consonant + vowel (CV) unit are analyzed. A speech unit file (not shown) in which the units are accumulated is prepared. The contents of this speech unit file are stored in a speech unit region (hereinafter, referred to as a speech unit memory) 26 secured in, for example, a main memory (not shown) at the start of the sentence-speech conversion process according to the sentence-speech conversion software. It is assumed to be loaded.

【００７１】音韻パラメータ生成処理部２５は、音声記
号列生成処理部２３から渡される、或いは音声記号列フ
ァイル３１に格納されている音声記号列中の音韻情報に
従って、上記したＣＶ単位の音声素片を音声素片メモリ
２６から順次読み出し、読み出した音声素片を接続する
ことにより音韻パラメータ（特徴パラメータ）を生成す
る。The phonological parameter generation processing unit 25, in accordance with the phonological information in the speech symbol string passed from the speech symbol string generation processing unit 23 or stored in the speech symbol string file 31, has the above-mentioned CV-based speech unit. Are sequentially read from the voice unit memory 26, and the phoneme parameters (feature parameters) are generated by connecting the read voice units.

【００７２】ピッチパターン生成処理部２４によりピッ
チパターンが生成され、音韻パラメータ生成処理部２５
により音韻パラメータが生成されると、音声合成部２内
の合成フィルタ処理部２７が起動される。この合成フィ
ルタ処理部２７は、図７に示すように、ホワイトノイズ
発生部２７１、インパルス発生部２７２、駆動音源切替
部２７３、及びＬＭＡ（Log Magnitude Approximation
）フィルタ（対数振幅近似フィルタ）２７４から構成
されており、上記生成されたピッチパターン（韻律パタ
ーン）と音韻パラメータから、次のようにして音声を合
成する。A pitch pattern is generated by the pitch pattern generation processing unit 24, and a phoneme parameter generation processing unit 25 is generated.
When the phoneme parameter is generated by the above, the synthesis filter processing unit 27 in the voice synthesis unit 2 is activated. As shown in FIG. 7, the synthesis filter processing unit 27 includes a white noise generation unit 271, an impulse generation unit 272, a driving sound source switching unit 273, and an LMA (Log Magnitude Approximation).
) Filter (logarithmic amplitude approximation filter) 274, and synthesizes speech as follows from the generated pitch pattern (prosodic pattern) and phonological parameters.

【００７３】まず、音声の有声部（Ｕ）では、駆動音源
切替部２７３によりインパルス発生部２７２側に切り替
えられる。インパルス発生部２７２は、ピッチパターン
生成処理部２４により生成されたピッチパターンに応じ
た間隔のインパルスを発生し、このインパルスを音源と
してＬＭＡフィルタ２７４を駆動する。一方、音声の無
声部（Ｖ）では、駆動音源切替部２７３によりホワイト
ノイズ発生部２７１側に切り替えられる。ホワイトノイ
ズ発生部２７１はホワイトノイズを発生し、このホワイ
トノイズを音源としてＬＭＡフィルタ２７４を駆動す
る。First, in the voiced part (U) of the voice, the driving sound source switching part 273 switches it to the impulse generating part 272 side. The impulse generation unit 272 generates impulses at intervals according to the pitch pattern generated by the pitch pattern generation processing unit 24, and drives the LMA filter 274 by using this impulse as a sound source. On the other hand, in the unvoiced part (V) of the voice, the driving sound source switching part 273 switches it to the white noise generating part 271 side. The white noise generation unit 271 generates white noise, and drives the LMA filter 274 using this white noise as a sound source.

【００７４】ＬＭＡフィルタ２７４は音声のケプストラ
ム（ケプストラムパラメータ）を直接フィルタ係数とす
るものである。本実施形態において音韻パラメータ生成
処理部２５により生成された音韻パラメータは前記した
ようにケプストラムであることから、この音韻パラメー
タがＬＭＡフィルタ２７４のフィルタ係数となり、駆動
音源切替部２７３により切り替えられる音源によって駆
動されることで、合成音声を出力する。The LMA filter 274 uses the speech cepstrum (cepstral parameter) directly as a filter coefficient. In the present embodiment, since the phoneme parameter generated by the phoneme parameter generation processing unit 25 is the cepstrum as described above, this phoneme parameter becomes the filter coefficient of the LMA filter 274 and is driven by the sound source switched by the drive sound source switching unit 273. As a result, a synthetic voice is output.

【００７５】合成フィルタ処理部２７（内のＬＭＡフィ
ルタ２７４）により合成された音声は、図示せぬＤ／Ａ
（ディジタル／アナログ）コンバータによりアナログ信
号に変換し、アンプを通してスピーカー等に出力するこ
とで、音として聞くことができる。［第２の実施形態］図８は本発明の第２の実施形態に係
る音声合成装置の概略構成を示すブロック図である。The voice synthesized by the synthesis filter processing unit 27 (the LMA filter 274 therein) is a D / A (not shown).
A (digital / analog) converter converts it into an analog signal and outputs it to a speaker or the like through an amplifier so that it can be heard as sound. [Second Embodiment] FIG. 8 is a block diagram showing the schematic arrangement of a speech synthesis apparatus according to the second embodiment of the present invention.

【００７６】この図８に示す音声合成装置が図１の音声
合成装置と最も異なる点は、音韻毎点ピッチ設定処理部
２２に代えて音節毎の点ピッチ設定を司る音節毎点ピッ
チ設定処理部４２が設けられていることである。The point that the speech synthesizer shown in FIG. 8 differs most from the speech synthesizer shown in FIG. 1 is that instead of the point pitch setting processing unit 22 for each phoneme, the point pitch setting processing unit for each syllable that controls the point pitch setting for each syllable. 42 is provided.

【００７７】この音声合成装置は、専用のソフトウェア
（文音声変換ソフトウェア）の実行により実現される文
音声変換処理機能、即ちテキストから音声を生成する文
音声変換処理（文音声合成処理）機能を有するもので、
その機能構成は、大別して言語処理部４、音声合成部５
及び音声記号列編集部６とに分けられる。This speech synthesizer has a sentence-speech conversion processing function realized by executing dedicated software (sentence-speech conversion software), that is, a sentence-speech conversion processing (sentence-speech synthesis processing) function for generating speech from text. Things
The functional configuration is roughly divided into a language processing unit 4 and a voice synthesis unit 5.
And a phonetic symbol string editing unit 6.

【００７８】言語処理部４は、入力文、例えば漢字かな
混じり文を解析して読み情報とアクセント情報を生成す
る処理を司る。音声合成部５は、言語処理部４での文解
析結果をもとに、音韻情報、各音節の（子音部及び母音
部の）継続時間及び音節毎の点ピッチ位置が記述された
音声記号列を生成する処理と、この音声記号列をもとに
音声を生成する処理を司る。音声記号列編集部６は、音
声合成部５にて変換生成された、或いは利用者が作成し
た音声記号列を利用者操作に応じて編集する処理を司
る。The language processing section 4 controls the processing of analyzing an input sentence, for example, a kanji / kana mixed sentence, to generate reading information and accent information. The speech synthesis unit 5 describes, based on the result of sentence analysis by the language processing unit 4, a phonetic symbol string in which phonological information, duration of each syllable (of consonant parts and vowel parts), and point pitch position of each syllable are described. And a process of generating a voice based on this voice symbol string. The voice symbol string editing unit 6 controls the process of editing the voice symbol string converted and generated by the voice synthesizing unit 5 or created by the user in accordance with a user operation.

【００７９】図８の音声合成装置において、文音声変換
（読み上げ）の対象となる文書（ここでは日本語文書）
はテキストファイル（図示せず）として保存されてい
る。本装置では、文音声変換ソフトウェアに従い、当該
ファイルから漢字かな混じり文を１文ずつ読み出して、
言語処理部４及び音声合成部５により以下に述べる文音
声変換処理を行ない、音声を合成する。In the speech synthesizer of FIG. 8, a document to be subjected to sentence-to-speech conversion (speech) (Japanese document here).
Is stored as a text file (not shown). In this device, according to the sentence-speech conversion software, the kanji and kana mixed sentences are read from the file one by one,
The language processing unit 4 and the speech synthesis unit 5 perform the sentence-speech conversion processing described below to synthesize speech.

【００８０】まずテキストファイルから読み出された漢
字かな混じり文は、言語処理部４内の言語解析処理部４
１に入力される。言語解析処理部４１は、入力される漢
字かな混じり文の形態素解析を行ない、読み情報とアク
セント情報を生成する。First, the kana-kana mixed sentence read from the text file is processed by the language analysis processing unit 4 in the language processing unit 4.
Input to 1. The language analysis processing unit 41 performs morphological analysis of an input kanji / kana mixed sentence and generates reading information and accent information.

【００８１】そのために、言語解析処理部４１は、文の
最小構成要素である「形態素」を見出し語に持つ形態素
辞書４２と、形態素間の接続規則が登録されている接続
規則ファイル４３を利用する。即ち言語解析処理部４１
は、入力文と形態素辞書４２とを照合することで得られ
る全ての形態素系列候補を求め（総当たり法）、その中
から、接続規則ファイル４３を参照して文法的に前後に
接続できる組み合わせを出力する。形態素辞書４２に
は、解析時に用いられる文法情報と共に、形態素の読み
並びにアクセントの型が登録されている。このため、形
態素解析により形態素が定まれば、同時に読みとアクセ
ント型も与えることができる。For this purpose, the language analysis processing unit 41 uses a morpheme dictionary 42 having a morpheme, which is the minimum constituent element of a sentence, as an entry word, and a connection rule file 43 in which connection rules between morphemes are registered. . That is, the language analysis processing unit 41
Finds all the morpheme sequence candidates obtained by matching the input sentence with the morpheme dictionary 42 (brute force method), and refers to the connection rule file 43 from among them to find a combination that can be connected grammatically before and after. Output. In the morpheme dictionary 42, morpheme readings and accent types are registered together with grammatical information used during analysis. For this reason, if a morpheme is determined by morphological analysis, reading and accent type can be given at the same time.

【００８２】例えば、前記第１の実施形態と同様に、
「公園へ行って本を読みます。」という文に対して形態
素解析を行なうと、／公園／へ／行って／本／を／読み／ます／。と形態素に分割される。同時に、各形態素に読みとアク
セント型が与えられ、／コウエン／エ／イッテ／ホ＾ン／ヲ／ヨミ／マ＾ス／となる。ここで、「＾」の入っている形態素は、その直
前の音節（前記第１の実施形態とは異なって音韻でない
点に注意）でピッチが高く、その直後の音節ではピッチ
が落ちるアクセントであることを意味する。また、
「＾」がない場合は、平板型のアクセントであることを
意味する。For example, as in the first embodiment,
Morphological analysis is performed for the sentence "Go to the park and read the book." / Park / Go / Go / Book / Read / Read /. And morphemes. At the same time, each morpheme is given a reading and an accent type, and becomes / kouen / e / itte / ho / n / wo / yomi / mass /. Here, the morpheme with "^" is an accent in which the pitch is high in the syllable immediately before that (note that it is not a phoneme unlike the first embodiment), and the pitch is dropped in the syllable immediately after that. Means that. Also,
When there is no "^", it means a flat accent.

【００８３】ところで、人間が文章を読むときには、こ
のような形態素単位でアクセントを付けて読むことはせ
ず、幾つかの形態素をひとまとめにして、そのまとまり
にアクセントを付けて読んでいる。By the way, when humans read a sentence, they do not read by adding accents in units of such morphemes, but make a group of several morphemes and read by adding an accent to the unit.

【００８４】そこで、このようなことを考慮して、言語
解析処理部４１では更に、一つのアクセント句（アクセ
ントを与える単位）で形態素をまとめると同時に、まと
めたことによるアクセントの移動も推定する。これに加
えて言語解析処理部４１は、母音の無声化や読み上げの
際のポーズ（息継ぎ）等の情報も付加し、上記の例で
は、最終的に次のような読み情報を生成する。Therefore, in consideration of the above, the language analysis processing unit 41 further collects the morphemes by one accent phrase (a unit giving an accent) and at the same time estimates the movement of the accent due to the collection. In addition to this, the language analysis processing unit 41 also adds information such as vowel devoicing and pause (breathing) during reading, and in the above example, finally generates the following reading information.

【００８５】／コウエンエ／イッテ．／ホ＾ンオ／ヨミマ＾（ス）／ここで、ピリオド「．」は息継ぎを、「（）」は母音
が無声化した音節を表す。/ Koenye / Itte. / Ho ^ o / Yomima ^ (s) / Here, the period "." Represents breath, and "()" represents a syllable in which a vowel is devoiced.

【００８６】さて、上記のようにして言語処理部４内の
言語解析処理部４１により読み情報が生成されると、音
声合成部５内の音韻継続時間計算処理部５１が起動され
る。音韻継続時間計算処理部５１は、言語解析処理部４
１で生成した読み情報に従って、入力文に含まれる各音
節の子音部並びに母音部の継続時間（単位はｍｓ）を決
定する。この音韻継続時間計算処理部５１での継続時間
の決定処理は、前記第１の実施形態と同様に、子音
（Ｃ）と母音（Ｖ）の境界（ＣＶわたり）の位置が等間
隔に並ぶようにする（図２参照）という、極めて簡単な
アルゴリズムにより実現されている。When the reading information is generated by the language analysis processing unit 41 in the language processing unit 4 as described above, the phoneme duration calculation processing unit 51 in the speech synthesis unit 5 is activated. The phoneme duration calculation processing unit 51 includes a language analysis processing unit 4
According to the reading information generated in 1, the durations (unit: ms) of the consonant part and the vowel part of each syllable included in the input sentence are determined. In the phoneme duration calculation processing unit 51, the duration determination processing is performed so that the positions of the boundaries (CV crossings) between consonants (C) and vowels (V) are arranged at equal intervals, as in the first embodiment. Is realized (see FIG. 2) by an extremely simple algorithm.

【００８７】音韻継続時間計算処理部５１により入力文
に含まれる各音節の（子音部並びに母音部の）継続時間
が決定されると、同じ音声合成部５内の音節毎点ピッチ
設定処理部５２が起動される。音節毎点ピッチ設定処理
部５２は、音韻継続時間計算処理部５１により決定され
た継続時間と、言語解析処理部４１により決定されたア
クセント情報に基づいて、点ピッチ位置を設定する。こ
こでは、点ピッチ設定の時間として、各音節の継続時間
を等間隔で４分割（即ち４等分）したうちの１点または
２点が決定される。この点ピッチ設定の具体例を、図２
のように各音節の（子音部並びに母音部の）継続時間が
決定されたアクセント句「ヨミマ（ス）」の場合につい
て、図９に示す。なお、図９では、横軸が時間を、縦軸
がピッチ（単位はオクターブｏｃｔ）を表し、白丸の位
置が点ピッチ位置である。ここで点ピッチ位置（ｘ，
ｙ）は、音節の継続時間を４等分した時点０〜３（但
し、句末の音節の最終時点は４）を示すｘと、ピッチ
（単位ｏｃｔ）を示すｙとにより表される。When the duration of each syllable (consonant portion and vowel portion) included in the input sentence is determined by the phoneme duration calculation processing unit 51, the syllable-to-syllable point pitch setting processing unit 52 in the same speech synthesis unit 5 is determined. Is started. The syllable-based point pitch setting processing unit 52 sets the point pitch position based on the duration determined by the phoneme duration calculation processing unit 51 and the accent information determined by the language analysis processing unit 41. Here, as the time for setting the point pitch, one point or two points of the duration of each syllable divided into four at equal intervals (that is, equally divided into four) are determined. A concrete example of this point pitch setting is shown in FIG.
FIG. 9 shows the case of the accent phrase “Yomima (su)” in which the duration of the syllable (of the consonant part and the vowel part) is determined as described above. In FIG. 9, the horizontal axis represents time, the vertical axis represents pitch (unit is octave oct), and the positions of white circles are point pitch positions. Where the point pitch position (x,
y) is represented by x indicating the time points 0 to 3 (where the last time of the syllable at the end of the phrase is 4) obtained by dividing the duration of the syllable into four equal parts, and y indicating the pitch (unit oct).

【００８８】本実施形態において、点ピッチ設定の時間
と周波数（ピッチ周波数）は、アクセント型と、ピッチ
の設定される音節がアクセント句の何番目に位置するか
と、更にその音節の種類或いは音節環境によって決ま
る。これについて、図１０（ａ），（ｂ）を参照して以
下に詳述する。In this embodiment, the point pitch setting time and frequency (pitch frequency) are the accent type, the position of the syllable for which the pitch is set in the accent phrase, the type of the syllable or the syllable environment. Depends on This will be described in detail below with reference to FIGS.

【００８９】まず図１０（ａ）は、子音を含む音節で始
まり、平板型または起伏型アクセントを持つアクセント
句の先頭音節（第１音節）に設定されるピッチ位置の付
与ルールを示す。First, FIG. 10 (a) shows a rule for assigning a pitch position which is set to a head syllable (first syllable) of an accent phrase having a flat type or an up-and-down type accent, starting from a syllable including a consonant.

【００９０】この図１０（ａ）から明らかなように、平
板型または起伏型アクセント句の先頭音節内の点ピッチ
の指定の時点は、その先頭音節に含まれる子音が有声子
音であるか無声子音であるかと、その先頭音節の直後の
音節（第２音節）が撥音であるか否か、即ち先頭音節の
種類（点ピッチ設定の対象音節自身の種類）と直後の音
節の種類、言い換えれば音節環境によって決まる。この
点ピッチ設定の対象となる先頭音節の種類と直後の音節
の種類の組み合わせ（音節環境）は４種類あり、その組
み合わせ数分の点ピッチ位置付与ルール（ルールＮｏ．
（１）〜（４））が用意されている。As is apparent from FIG. 10 (a), at the point in time when the point pitch in the head syllable of the flat or undulating accent phrase is specified, the consonant contained in the head syllable is a voiced consonant or an unvoiced consonant. And whether or not the syllable immediately after the first syllable (second syllable) is syllable, that is, the type of the first syllable (the type of the target syllable itself for the point pitch setting) and the type of the syllable immediately after that, in other words, the syllable. It depends on the environment. There are four types of combinations (syllabic environment) of the type of the leading syllable and the type of the syllable immediately after which the point pitch is set, and the point pitch position assignment rules (rule No.
(1) to (4)) are prepared.

【００９１】次に図１０（ｂ）は、平板型または起伏型
アクセントを持つアクセント句の第２音節に設定される
ピッチ位置の付与ルールを示す。この図１０（ｂ）から
明らかなように、平板型または起伏型アクセント句の第
２音節内の点ピッチの指定の時点は、その音節が撥音で
あるか否か、即ちその音節自身の種類（音節種）により
決まる。ここでは、当該音節種が撥音である場合の点ピ
ッチ位置付与ルール（ルールＮｏ．（１））と当該音節
種が撥音以外である場合の点ピッチ位置付与ルール（ル
ールＮｏ．（２））とが用意されている。Next, FIG. 10B shows a rule for assigning a pitch position set in the second syllable of an accent phrase having a flat plate type or a relief type accent. As is apparent from FIG. 10 (b), at the point in time when the point pitch in the second syllable of the flat type or undulating accent phrase is designated, whether or not the syllable is syllable, that is, the type of the syllable itself ( Syllable type). Here, a point pitch position assignment rule (rule No. (1)) when the syllable type is sound repellency and a point pitch position assignment rule (rule No. (2)) when the syllable type is other than sound repellency. Is prepared.

【００９２】以上の図１０（ａ），（ｂ）に示した点ピ
ッチ位置付与ルールに基づいて決定される点ピッチ位置
を、４つのアクセント句の例について図１１（ａ）〜
（ｄ）に示す。この図１１（ａ）〜（ｄ）では、図９と
同様に、横軸が時間を、縦軸がピッチ（単位はオクター
ブｏｃｔ）を表し、白丸の位置が点ピッチ位置である。The point pitch positions determined on the basis of the point pitch position assignment rules shown in FIGS. 10A and 10B are shown in FIGS.
It shows in (d). In FIGS. 11A to 11D, as in FIG. 9, the horizontal axis represents time, the vertical axis represents pitch (unit is octave oct), and the positions of white circles are point pitch positions.

【００９３】まず、アクセント句先頭音節内の子音が有
声子音で、第２音節が撥音でないアクセント句の場合、
例えば「ヨミマス（読みます）」というアクセント句の
場合であれば、図１０（ａ）中のルールＮｏ．（４）と
図１０（ｂ）中のルールＮｏ．（２）とが適用され、図
１１（ａ）に示すような点ピッチ位置が生成（設定）さ
れる。First, in a case where the consonant in the first syllable of the accent phrase is a voiced consonant and the second syllable is a non-syllable accent phrase,
For example, in the case of the accent phrase "Yomimasu (read)", the rule No. in FIG. (4) and the rule No. in FIG. (2) is applied to generate (set) the point pitch position as shown in FIG.

【００９４】次に、アクセント句先頭音節内の子音が有
声子音で、第２音節が撥音であるアクセント句の場合、
例えば「ロンジマス（論じます）」というアクセント句
の場合であれば、図１０（ａ）中のルールＮｏ．（３）
と図１０（ｂ）中のルールＮｏ．（１）とが適用され、
図１１（ｂ）に示すような点ピッチ位置が生成される。Next, when the consonant in the first syllable of the accent phrase is a voiced consonant and the second syllable is a syllable,
For example, in the case of the accent phrase “longjimas (discussed)”, the rule No. in FIG. (3)
And the rule No. in FIG. (1) and are applied,
Point pitch positions as shown in FIG. 11B are generated.

【００９５】次に、アクセント句先頭音節内の子音が無
声子音で、第２音節が撥音でないアクセント句の場合、
例えば「トリマス（取ります）」というアクセント句の
場合であれば、図１０（ａ）中のルールＮｏ．（２）と
図１０（ｂ）中のルールＮｏ．（２）とが適用され、図
１１（ｃ）に示すような点ピッチ位置が生成される。Next, when the consonant in the first syllable of the accent phrase is an unvoiced consonant and the second syllable is an accent phrase which is not a syllable,
For example, in the case of the accent phrase "trimas (take)", the rule number in FIG. Rule No. 2 in (2) and FIG. (2) is applied to generate a point pitch position as shown in FIG. 11 (c).

【００９６】次に、アクセント句先頭音節内の子音が無
声子音で、第２音節が撥音であるアクセント句の場合、
例えば「カンジマス（感じます）」というアクセント句
の場合であれば、図１０（ａ）中のルールＮｏ．（１）
と図１０（ｂ）中のルールＮｏ．（１）とが適用され、
図１１（ｄ）に示すような点ピッチ位置が生成される。Next, in the case of an accent phrase in which the consonant in the first syllable of the accent phrase is a voiceless consonant and the second syllable is a syllable,
For example, in the case of the accent phrase “Kanjimasu (feels)”, the rule No. in FIG. (1)
And the rule No. in FIG. (1) and are applied,
Point pitch positions as shown in FIG. 11D are generated.

【００９７】このようにして生成された複数の点ピッチ
位置を直線で補間して得られるピッチパターン（図１１
（ａ）〜（ｄ）中の太線部分）は、前記第１の実施形態
で得られるピッチパターン同様に、実際の人間の発声す
る音声のそれをよく近似している。A pitch pattern obtained by interpolating a plurality of point pitch positions generated in this way with a straight line (see FIG. 11).
Similar to the pitch pattern obtained in the first embodiment, the thick line portions in (a) to (d)) closely approximate that of an actual human voice.

【００９８】実際の音声の平板型または起伏型アクセン
トのピッチパターンでは、アクセント区先頭子音が無声
子音のときよりも有声子音のときの方が、低いピッチか
ら始まる傾向がある。また、第２音節が撥音の場合に
は、ピッチの上昇の開始時点と終了時点がそうでない場
合より早くなる傾向があることが知られている。In a pitch pattern of an actual flat or undulating accent, a voiced consonant tends to start at a lower pitch than a voiced consonant as a leading consonant of an accent block. Further, it is known that when the second syllable is sound repellant, the start time and the end time of the pitch rise tend to be earlier than when it is not.

【００９９】前記第１の実施形態において既に説明した
ことの繰り返しになるが、従来は、このような実際の音
声のピッチ特徴を全く考慮せず、音節種や音節環境に関
係なくピッチを与えていた。音節種や音節環境を考慮し
てピッチパターンを変えることは、図１６で示したよう
な従来技術でも可能ではある。しかし、従来技術では、
音節に対してその中心に一つだけの点ピッチしか設定で
きないため、開始ピッチの設定については第１音節の点
ピッチの上下によってかろうじて変えることが可能であ
るものの、ピッチ上昇の開始時点及びピッチ上昇の終了
時点は、それぞれ第１音節の中心時点と第２音節の中心
時点にしか設定できない。このため従来技術では、実際
の人間の音声のピッチパターンに近づけるのは困難であ
った。To reiterate what has already been described in the first embodiment, conventionally, the pitch is given irrespective of the syllable type and the syllable environment without considering such actual pitch characteristics of the voice. It was Changing the pitch pattern in consideration of the syllable type and the syllable environment is possible with the conventional technique as shown in FIG. However, in the prior art,
Since only one point pitch can be set at the center of a syllable, it is possible to barely change the setting of the start pitch by changing the point pitch of the first syllable. Can be set only to the central time point of the first syllable and the central time point of the second syllable, respectively. Therefore, it has been difficult for the conventional technique to approximate the pitch pattern of the actual human voice.

【０１００】ところで、精密なピッチ制御を行なうに
は、特開平３−１６４８００号公報でも指摘しているよ
うに、１音節の継続時間を４等分した程度の間隔で点ピ
ッチを設定する必要がある。但し、４等分した場合で
も、全ての時点にピッチ周波数を与えるのは効率が悪
く、１乃至２点で十分であり、全く点ピッチを与えなく
ともよい音節もある。By the way, in order to perform precise pitch control, it is necessary to set the point pitch at intervals of about four equal parts of the duration of one syllable, as pointed out in Japanese Patent Laid-Open No. 3-164800. is there. However, even if it is divided into four, it is inefficient to give the pitch frequency at all time points, and one or two points are sufficient, and in some syllables, it is not necessary to give the point pitch at all.

【０１０１】さて、音節毎点ピッチ設定処理部５２によ
り音節毎の点ピッチ位置が決定されると、同じ音声合成
部５内の音声記号列生成処理部５３が起動される。音声
記号列生成処理部５３は、言語解析処理部４１により得
られた音韻情報、音韻継続時間計算処理部５１により得
られた各音節の（子音部及び母音部の）継続時間、及び
音節毎点ピッチ設定処理部５２により得られた点ピッチ
位置から、これらを記述した音声記号列を生成する。When the point pitch setting processing unit 52 for each syllable determines the point pitch position for each syllable, the speech symbol string generation processing unit 53 in the same speech synthesis unit 5 is activated. The phonetic symbol string generation processing unit 53 includes the phonological information obtained by the language analysis processing unit 41, the duration of each syllable (of the consonant portion and the vowel portion) obtained by the phonological duration calculation processing unit 51, and each syllable point. From the point pitch positions obtained by the pitch setting processing unit 52, a phonetic symbol string describing these is generated.

【０１０２】図１２は、アクセント句「ヨミマス（読み
ます）」についての音声記号列の例を示す。図から明ら
かなように、音声記号列は、音韻情報１２１０と韻律情
報１２２０とからなる。この韻律情報１２２０は、イン
トネーション成分１２２１と、音韻情報１２１０の各音
節毎に決定された音節継続時間（母音部継続時間、また
は子音部継続時間並びに母音部継続時間）と点ピッチ位
置の情報（音節継続時間・ピッチ情報）１２２２からな
る。イントネーション成分１２２１は、前記第１の実施
形態におけるイントネーション成分６２１と同様にピッ
チの自然降下成分を示す。また、図１２中の音声記号列
中の「；」は、アクセント句の終端を示す。なお、音韻
情報１２１０中の「（ス）」は母音が無声化した「ス」
を表す。FIG. 12 shows an example of a phonetic symbol string for the accent phrase "Yomimas (read)". As is clear from the figure, the phonetic symbol string is composed of phoneme information 1210 and prosody information 1220. This prosody information 1220 includes intonation component 1221, syllable duration (vowel duration, consonant duration and vowel duration) determined for each syllable of phonological information 1210, and point pitch information (syllable). Duration / pitch information) 1222. The intonation component 1221 represents a spontaneous pitch decrease component, like the intonation component 621 in the first embodiment. Further, “;” in the phonetic symbol string in FIG. 12 indicates the end of the accent phrase. Note that “(s)” in the phoneme information 1210 is a “s” in which the vowel is devoiced.
Represents

【０１０３】音声記号列生成処理部５３は、生成した音
声記号列を同じ音声合成部５内のピッチパターン生成処
理部５４に（図示せぬメモリを通して）渡して、このピ
ッチパターン生成処理部５４により処理を進めさせるこ
とも、当該音声記号列を音声記号列編集部６にて編集す
るために一旦音声記号列編集部６内の音声記号列ファイ
ル６１に書き込むことも可能である。この音声記号列編
集部６による音声記号列ファイル６１内の音声記号列の
編集処理については後述する。The voice symbol string generation processing unit 53 passes the generated voice symbol string to the pitch pattern generation processing unit 54 in the same voice synthesis unit 5 (through a memory (not shown)), and the pitch pattern generation processing unit 54 performs the processing. It is possible to proceed with the processing or to write the phonetic symbol string in the phonetic symbol string file 61 in the phonetic symbol string editing unit 6 once in order to edit the phonetic symbol string. The process of editing the phonetic symbol string in the phonetic symbol string file 61 by the phonetic symbol string editing unit 6 will be described later.

【０１０４】さて、音声合成部５内のピッチパターン生
成処理部５４は、音声記号列生成処理部５３から渡され
る、或いは音声記号列ファイル６１に格納されている音
声記号列をもとにピッチパターン（韻律パラメータ）を
生成する処理を行なう。即ちピッチパターン生成処理部
５４は、対象となる音声記号列中に設定されている各音
節毎の継続時間（母音部継続時間、または子音部継続時
間並びに母音部継続時間）と点ピッチ位置をもとに、各
点ピッチ間を直線補間してピッチのアクセント成分を例
えば１０ｍｓ毎に生成する。ピッチパターン生成処理部
５４は更に、対象となる音声記号列中に設定されている
イントネーション成分（ピッチの自然降下成分）を加え
て、合成すべき音声の１０ｍｓ毎のピッチパターンを得
る。The pitch pattern generation processing unit 54 in the voice synthesis unit 5 is based on the voice symbol string passed from the voice symbol string generation processing unit 53 or stored in the voice symbol string file 61. A process of generating (prosodic parameter) is performed. That is, the pitch pattern generation processing unit 54 also sets the duration (vowel duration or consonant duration and vowel duration) and point pitch position for each syllable set in the target phonetic symbol sequence. Then, linear interpolation is performed between the respective point pitches to generate a pitch accent component, for example, every 10 ms. The pitch pattern generation processing unit 54 further adds an intonation component (spontaneous drop component of pitch) set in the target speech symbol sequence to obtain a pitch pattern for every 10 ms of speech to be synthesized.

【０１０５】一方、音声合成部５内の音韻パラメータ生
成処理部５５は、音声記号列生成処理部５３から渡され
る、或いは音声記号列ファイル６１に格納されている音
声記号列中の音韻情報をもとに音韻パラメータを生成す
る処理を、例えばピッチパターン生成処理部５４による
ピッチパターン生成処理と並行して行なう。この音韻パ
ラメータ生成処理部５５による音韻パラメータ生成処理
については、前記第１の実施形態における音韻パラメー
タ生成処理部２５と同様に行なわれるため、説明を省略
する。On the other hand, the phoneme parameter generation processing unit 55 in the speech synthesis unit 5 also includes the phoneme information in the phonetic symbol string passed from the phonetic symbol string generation processing unit 53 or stored in the phonetic symbol string file 61. The process of generating the phoneme parameters is performed in parallel with the pitch pattern generation process by the pitch pattern generation processing unit 54, for example. The phonological parameter generation processing by the phonological parameter generation processing unit 55 is performed in the same manner as the phonological parameter generation processing unit 25 in the first embodiment, and the description thereof is omitted.

【０１０６】ピッチパターン生成処理部５４によりピッ
チパターンが生成され、音韻パラメータ生成処理部５５
により音韻パラメータが生成されると、（図７に示した
ような合成フィルタ処理部２７と同様の構成の）合成フ
ィルタ処理部５７が起動され、前記第１の実施形態にお
ける合成フィルタ処理部２７と同様にして、この生成さ
れたピッチパターンと音韻パラメータから音声を合成す
る。The pitch pattern generation processing unit 54 generates a pitch pattern, and the phoneme parameter generation processing unit 55.
When the phoneme parameter is generated by the above, the synthesis filter processing unit 57 (having the same configuration as the synthesis filter processing unit 27 as shown in FIG. 7) is activated, and the synthesis filter processing unit 27 in the first embodiment is used. Similarly, speech is synthesized from the generated pitch pattern and phoneme parameter.

【０１０７】次に、音声記号列編集部６による音声記号
列ファイル６１内の音声記号列の編集処理について、図
１３のフローチャートを参照して説明する。まず音声記
号列編集部６は、音声記号列ファイル６１の他に、音声
記号列編集ツール（音声記号列編集ソフトウェア）に従
う音声記号列の修正処理を司る音声記号列修正処理部６
２と、ＣＲＴディスプレイ、液晶ディスプレイ等のディ
スプレイ装置６３と、キーボード、マウス等の入力装置
６４とを有している。このディスプレイ装置６３及び入
力装置６４には、図１３の音声合成装置を実現するパー
ソナルコンピュータ等の情報処理装置の持つディスプレ
イ装置及び入力装置を充てることができる。音声記号列
ファイル６１には、言語処理部４内の音声記号列生成処
理部５３によって前記した如く生成された音声記号列、
或いは利用者がパーソナルコンピュータ等を用いて作成
した音声記号列が保存される。Next, the process of editing the phonetic symbol string in the phonetic symbol string file 61 by the phonetic symbol string editing unit 6 will be described with reference to the flowchart of FIG. First, the phonetic symbol string editing unit 6 controls the phonetic symbol string correction processing unit 6 in addition to the phonetic symbol string file 61 and the process of correcting the phonetic symbol string according to the phonetic symbol string editing tool (phonetic symbol string editing software).
2, a display device 63 such as a CRT display and a liquid crystal display, and an input device 64 such as a keyboard and a mouse. The display device 63 and the input device 64 may be the display device and the input device included in an information processing device such as a personal computer that realizes the voice synthesizer of FIG. The phonetic symbol string file 61 includes a phonetic symbol string generated as described above by the phonetic symbol string generation processing unit 53 in the language processing unit 4,
Alternatively, a voice symbol string created by the user using a personal computer or the like is stored.

【０１０８】音声記号列編集部６内の音声記号列修正処
理部６２は、利用者操作に従って入力装置６４を通して
起動されると、音声記号列生成処理部５３によって生成
された、或いは利用者が作成した、図１２に示したよう
な構造の音声記号列を音声記号列ファイル６１から読み
出す（ステップＳ１）。When the voice symbol string correction processing unit 62 in the voice symbol string editing unit 6 is activated through the input device 64 according to the user's operation, it is generated by the voice symbol string generation processing unit 53 or created by the user. Then, the phonetic symbol string having the structure shown in FIG. 12 is read from the phonetic symbol string file 61 (step S1).

【０１０９】次に音声記号列修正処理部６２は、読み出
した音声記号列中の音節継続時間及び点ピッチ位置の指
定に従い、音声合成部５内のピッチパターン生成処理部
５４と同様にして、例えば横軸を時間軸としたピッチパ
ターンを生成し（ステップＳ２）、そのピッチパターン
を、読み出した音声記号列中の音韻情報の示す音韻記号
と、その音韻の時間軸上の範囲と共に、ディスプレイ装
置６３の表示画面にグラフにより表示する（ステップＳ
３）。このピッチパターン表示例を、図１２の音声記号
列の場合について、図１４において符号１４１で示す。
なお、ピッチパターン（ピッチパターン１４１）は、音
声記号列中に設定されている点ピッチ位置間を直線で補
間することにより生成される（描かれる）。Next, the phonetic symbol string correction processing unit 62, in accordance with the designation of the syllable duration time and the point pitch position in the read phonetic symbol string, in the same manner as the pitch pattern generation processing unit 54 in the phonetic synthesis unit 5, for example, A pitch pattern whose horizontal axis is the time axis is generated (step S2), and the pitch pattern is displayed together with the phoneme symbol indicated by the phoneme information in the read voice symbol string and the range of the phoneme on the time axis. Is displayed as a graph on the display screen of (step S
3). This pitch pattern display example is indicated by reference numeral 141 in FIG. 14 for the case of the phonetic symbol string in FIG.
The pitch pattern (pitch pattern 141) is generated (drawn) by linearly interpolating between the point pitch positions set in the phonetic symbol string.

【０１１０】さて、本実施例におけるピッチパターン表
示では、図１４に示したように、既に音声記号列中で指
定されている点ピッチ位置は白丸で表示され、音声記号
列中で指定されていないが、利用者が指定可能な点ピッ
チ位置は黒丸で表示される。これら白丸並びに黒丸の位
置は、利用者が入力装置６４のマウス等を操作すること
で移動可能である。In the pitch pattern display in this embodiment, as shown in FIG. 14, the point pitch positions already designated in the phonetic symbol string are displayed as white circles and are not designated in the phonetic symbol string. However, the point pitch positions that the user can specify are displayed as black circles. The positions of these white circles and black circles can be moved by the user operating the mouse or the like of the input device 64.

【０１１１】例えば、マウスを使用した場合であれば、
まず利用者はマウスを移動操作してマウスカーソルを移
動したい点ピッチの丸の上にもっていき、その状態でマ
ウスボタンを押す。この状態で利用者は、マウスを移動
操作して所望の位置にマウスカーソルを移動し、しかる
後にマウスボタンを離す。For example, if a mouse is used,
First, the user operates the mouse to move the mouse cursor to the circle with the desired point pitch and presses the mouse button in that state. In this state, the user moves the mouse to move the mouse cursor to a desired position, and then releases the mouse button.

【０１１２】音声記号列修正処理部６２は、この利用者
のマウス操作に応じて、移動指定された丸の位置（点ピ
ッチの位置）を移動する（ステップＳ４）。そして音声
記号列修正処理部６２は、移動した丸と、その前後の丸
との間の直線を新たに補間する。更に音声記号列修正処
理部６２は、音声記号列ファイル６１から読み出してあ
る音声記号列を、丸の位置の移動（変更）に応じて変更
（修正）する（ステップＳ５）。The voice symbol string correction processing unit 62 moves the circle position (point pitch position) designated for movement in response to the user's mouse operation (step S4). Then, the phonetic symbol string correction processing unit 62 newly interpolates a straight line between the moved circle and the circles before and after the circle. Further, the phonetic symbol string correction processing unit 62 changes (corrects) the phonetic symbol string read from the phonetic symbol string file 61 according to the movement (change) of the circle position (step S5).

【０１１３】このようにして、例えば図１５（ａ）に示
す音声記号列に従って図１５（ｂ）のように表示された
ピッチパターンを、白丸位置の移動操作により図１５
（ｃ）のように変更した場合であれば、図１５（ｄ）に
示すような音声記号列に変更される。Thus, for example, the pitch pattern displayed as shown in FIG. 15B according to the phonetic symbol sequence shown in FIG. 15A is moved to the position shown in FIG.
In the case of the change as shown in (c), the phonetic symbol string is changed as shown in FIG.

【０１１４】また音声記号列修正処理部６２は、図１４
において破線で示したように、実音声を分析して得られ
るピッチパターンを参照パターン１４２として、ピッチ
パターン１４１と共に表示する。これにより利用者は、
参照パターン１４２を見ながら、当該参照パターンに最
も近づくように、音声記号列に基づいてグラフ化された
ピッチパターン１４１上の丸の位置（ドッと位置）を修
正するならば、人間の発声した音声に極めて近い抑揚を
持つ音声を合成することが可能な音声記号列を容易に作
成（編集）することができる。Further, the phonetic symbol string correction processing unit 62 is shown in FIG.
As indicated by the broken line in FIG. 3, the pitch pattern obtained by analyzing the actual voice is displayed as the reference pattern 142 together with the pitch pattern 141. This allows the user to
While looking at the reference pattern 142, if the position of the circle (dotted position) on the pitch pattern 141, which is graphed based on the phonetic symbol string, is corrected so as to come closest to the reference pattern, it will be very similar to the voice uttered by a human. It is possible to easily create (edit) a speech symbol string capable of synthesizing a speech having a close intonation.

【０１１５】さて、利用者操作に従うピッチパターン変
更に応じて修正された音声記号列は、入力装置６４を通
して利用者からのセーブ指示が与えられた場合には（ス
テップＳ６）、音声記号列修正処理部６２によって音声
記号列ファイル６１にセーブされる（ステップＳ７）。Now, the voice symbol string corrected according to the pitch pattern change according to the user's operation, when a save instruction is given from the user through the input device 64 (step S6), the voice symbol string correction processing is performed. It is saved in the phonetic symbol string file 61 by the unit 62 (step S7).

【０１１６】音声記号列修正処理部６２は、音声記号列
をセーブすると、利用者により、音声記号列編集処理の
終了ではなくて、次の音声記号列の読み出しが指定され
ているならば（ステップＳ８，Ｓ９）、ステップＳ１に
戻って次の音声記号列を音声記号列ファイル６１から読
み出し、その音声記号列を対象にステップＳ２以降の処
理を行なう。When the phonetic symbol string modification processing unit 62 saves the phonetic symbol string, if the user has designated reading of the next phonetic symbol string instead of ending the phonetic symbol string editing process (step (S8, S9), the process returns to step S1, the next phonetic symbol string is read from the phonetic symbol string file 61, and the process from step S2 is performed on the phonetic symbol string.

【０１１７】このような音声記号列編集処理は、前記第
１の実施形態における音声記号列編集部３（内の音声記
号列修正処理部３２）においても同様に行なわれる。さ
て、音声記号列編集部６（内の音声記号列修正処理部６
２）による音声記号列編集処理で修正された音声記号列
ファイル６１内の音声記号列、即ち音声記号列生成処理
部５３により自動生成された後に利用者操作に従って修
正された、或いは利用者が作成した後利用者操作に従っ
て修正された音声記号列は、音声合成部５による音声合
成に供される。これにより、人間の発声した音声に極め
て近い抑揚を持つ音声を合成することができる。Such a voice symbol sequence editing process is similarly performed in the voice symbol sequence editing unit 3 (in the voice symbol sequence correction processing unit 32) in the first embodiment. Now, the voice symbol string editing unit 6 (the voice symbol string correction processing unit 6 in
The voice symbol sequence in the voice symbol sequence file 61 corrected by the voice symbol sequence edit process according to 2), that is, the voice symbol sequence is automatically generated by the voice symbol sequence generation processing unit 53 and then corrected according to the user operation, or created by the user. After that, the voice symbol string corrected according to the user operation is provided for voice synthesis by the voice synthesizer 5. As a result, it is possible to synthesize a voice having intonation extremely close to the voice uttered by a human.

【０１１８】[0118]

【発明の効果】以上説明したように本発明によれば、設
定する点ピッチの数を最小限に抑えながらも、従来に比
べて細かなピッチ制御を可能にし、より滑らかで人間の
声の抑揚に近いピッチパターンを、テキスト或いは音声
記号列から生成することができる。また、本発明によれ
ば、利用者が、最小限の労力で簡単に、利用者好みのピ
ッチパターンを音声合成に供することができる等の効果
もある。As described above, according to the present invention, the number of set point pitches can be minimized, but finer pitch control can be performed as compared with the conventional method, so that a smoother and smoother human voice can be suppressed. A pitch pattern close to can be generated from a text or phonetic symbol string. Further, according to the present invention, there is also an effect that the user can easily provide a user's favorite pitch pattern for voice synthesis with a minimum of labor.

[Brief description of drawings]

【図１】本発明の第１の実施形態に係る音声合成装置の
概略構成を示すブロック図。FIG. 1 is a block diagram showing a schematic configuration of a speech synthesizer according to a first embodiment of the present invention.

【図２】同第１の実施形態における音韻継続時間の決定
方法を説明するための図。FIG. 2 is a diagram for explaining a method of determining a phoneme duration in the first embodiment.

【図３】同第１の実施形態におけるピッチ設定方法を説
明するための図。FIG. 3 is a diagram for explaining a pitch setting method according to the first embodiment.

【図４】同第１の実施形態における音韻種或いは音韻環
境を考慮した点ピッチ位置付与ルール（点ピッチ設定ル
ール）を整理して示す図。FIG. 4 is a diagram showing an arrangement of point-pitch position assignment rules (point-pitch setting rules) in consideration of phoneme types or phoneme environments according to the first embodiment.

【図５】図４のルールに従って決定される点ピッチ位置
の具体例を４つのアクセント句について示す図。FIG. 5 is a view showing a specific example of a point pitch position determined according to the rule of FIG. 4 for four accent phrases.

【図６】同第１の実施形態における音声記号列の一例を
示す図。FIG. 6 is a diagram showing an example of a phonetic symbol string according to the first embodiment.

【図７】図１中の合成フィルタ処理部２７の構成を示す
ブロック図。FIG. 7 is a block diagram showing a configuration of a synthesis filter processing unit 27 in FIG.

【図８】本発明の第２の実施形態に係る音声合成装置の
概略構成を示すブロック図。FIG. 8 is a block diagram showing a schematic configuration of a speech synthesizer according to a second embodiment of the present invention.

【図９】同第２の実施形態におけるピッチ設定方法を説
明するための図。FIG. 9 is a diagram for explaining a pitch setting method according to the second embodiment.

【図１０】同第２の実施形態における音節種或いは音節
環境を考慮した点ピッチ位置付与ルール（点ピッチ設定
ルール）を整理して示す図。FIG. 10 is a diagram summarizing and showing a point pitch position assigning rule (point pitch setting rule) in consideration of a syllable type or a syllable environment in the second embodiment.

【図１１】図１０のルールに従って決定される点ピッチ
位置の具体例を４つのアクセント句について示す図。FIG. 11 is a diagram showing a specific example of a point pitch position determined according to the rule of FIG. 10 for four accent phrases.

【図１２】同第２の実施形態における音声記号列の一例
を示す図。FIG. 12 is a diagram showing an example of a phonetic symbol string according to the second embodiment.

【図１３】図８中の音声記号列編集部６（内の音声記号
列修正処理部６２）による音声記号列編集処理を説明す
るためのフローチャート。13 is a flowchart for explaining a phonetic symbol string editing process by a phonetic symbol string editing unit 6 (including a phonetic symbol string correction processing unit 62) in FIG.

【図１４】音声記号列編集処理におけるピッチパターン
表示例を示す図。FIG. 14 is a diagram showing an example of pitch pattern display in a voice symbol string editing process.

【図１５】音声記号列編集処理におけるピッチパターン
と音声記号列の変更例を示す図。FIG. 15 is a diagram showing an example of changing a pitch pattern and a phonetic symbol string in a phonetic symbol string editing process.

【図１６】従来のピッチ設定方法を説明するための図。FIG. 16 is a diagram for explaining a conventional pitch setting method.

[Explanation of symbols]

１，４…言語処理部、２，５…音声合成部、３，６…音声記号列編集部、１１，４１…言語解析処理部、２１，５１…音韻継続時間計算処理部、２２…音韻毎点ピッチ設定処理部、２３，５３…音声記号列生成処理部、２４，５４…ピッチパターン生成処理部（韻律パラメー
タ生成処理手段）、２５，５５…音韻パラメータ生成処理部（特徴パラメー
タ生成処理手段）、２６，５６…音声素片メモリ、２７，５７…合成フィルタ処理部、３１，６１…音声記号列ファイル、３２，６２…音声記号列修正処理部、３３，６３…ディスプレイ装置、３４，６４…入力装置、５２…音節毎点ピッチ設定処理部、６１０，１２１０…音韻情報、６２０，１２２０…韻律情報、６２２…音韻継続時間・ピッチ情報、１２２２…音節継続時間・ピッチ情報。1, 4 ... Language processing unit, 2, 5 ... Speech synthesis unit, 3, 6 ... Speech symbol string editing unit, 11, 41 ... Language analysis processing unit, 21, 51 ... Phonological duration calculation processing unit, 22 ... Phonological units Point pitch setting processing unit, 23, 53 ... Speech symbol string generation processing unit, 24, 54 ... Pitch pattern generation processing unit (prosodic parameter generation processing unit), 25, 55 ... Phonological parameter generation processing unit (feature parameter generation processing unit) 26, 56 ... Speech element memory, 27, 57 ... Synthesis filter processing section, 31, 61 ... Speech symbol string file, 32, 62 ... Speech symbol string correction processing section, 33, 63 ... Display device, 34, 64 ... Input device, 52 ... Pitch setting processing unit for each syllable, 610, 1210 ... Phoneme information, 620, 1220 ... Prosodic information, 622 ... Phoneme duration / pitch information, 1222 ... Syllable duration time -Pitch information.

Claims

[Claims]

1. A voice synthesizing method for synthesizing a voice based on a pitch pattern generated by giving a plurality of point pitches based on accent information of a word or a phrase and interpolating between the given plurality of point pitches. 2. The speech synthesis method according to claim 1, wherein the position on the time axis at which the point pitch is given is determined based on at least one of the phoneme environment or the phoneme type of the phoneme to give the point pitch.

2. A voice unit storage medium for storing voice feature parameters in the form of a voice unit comprising a predetermined synthesis unit, and a voice unit storage medium for storing voice based on input phoneme information. By using the accent parameter input corresponding to the phoneme information, the feature parameter generation processing means for reading out the voice unit, connecting the read voice units, and generating the feature parameter of the voice to be synthesized. A point pitch setting processing unit for each phoneme that sets a point pitch for each phoneme, and determines the position of the point pitch on the time axis based on at least one of the phoneme environment or the phoneme type of the phoneme to which the set point pitch belongs. A pitch for generating a voice pitch pattern by interpolating between the phoneme-based point pitch setting processing means and the adjacent point pitches set by the phoneme-based point pitch setting processing means. Pattern generation processing means; synthesis filter processing means for synthesizing speech from the characteristic parameter of the speech to be synthesized generated by the characteristic parameter generation processing means and the pitch pattern generated by the pitch pattern generation processing means. A speech synthesizer characterized by:

3. A voice synthesizing method for synthesizing a voice based on a pitch pattern generated by giving a plurality of point pitches based on accent information of a word or a phrase and interpolating between the plurality of given point pitches. 2. The method of synthesizing speech according to claim 1, wherein the position on the time axis at which the point pitch is given is determined based on at least one of the syllable environment and the syllable type of the syllable to which the point pitch is given.

4. A voice unit storage medium for storing voice feature parameters in the form of a voice unit comprising a predetermined synthesis unit, and a voice unit storing medium for storing voice based on input phoneme information. By using the accent parameter input corresponding to the phoneme information, the feature parameter generation processing means for reading out the voice unit, connecting the read voice units, and generating the feature parameter of the voice to be synthesized. A syllable-based point pitch setting processing means for setting a point pitch for each syllable, wherein the position of the point pitch on the time axis is determined based on at least one of the syllable environment or the syllable type of the syllable to which the set point pitch belongs. A pitch for generating a voice pitch pattern by interpolating between the point pitch setting processing unit for each syllable and the adjacent point pitch set by the point pitch setting processing unit for each syllable Pattern generation processing means; synthesis filter processing means for synthesizing speech from the characteristic parameter of the speech to be synthesized generated by the characteristic parameter generation processing means and the pitch pattern generated by the pitch pattern generation processing means. A speech synthesizer characterized by:

5. A voice synthesizing method for synthesizing a voice based on a pitch pattern generated by giving a plurality of point pitches based on accent information of a word or a phrase and interpolating between the plurality of given point pitches. 2. In the speech synthesis method, at least four point pitch specifiable points are set within the duration of one phoneme, and the point pitch is designated for two points or less.

6. A voice unit storage medium for storing voice feature parameters in the form of a voice unit comprising a predetermined synthesis unit, and a voice unit storing medium for storing voice based on input phoneme information. By using the accent parameter input corresponding to the phoneme information, the feature parameter generation processing means for reading out the voice unit, connecting the read voice units, and generating the feature parameter of the voice to be synthesized. It is a point pitch setting processing unit for each phoneme that sets a point pitch for each phoneme, and sets at least four points at which point pitch can be designated within the duration of one phoneme. A pitch pattern of voice is generated by interpolating between the point pitch setting processing unit for each phoneme that specifies the point pitch and the adjacent point pitch set by the point pitch setting processing unit for each phoneme. Pitch pattern generation processing means, and synthesis filter processing means for synthesizing speech from the characteristic parameters of the speech to be synthesized generated by the characteristic parameter generation processing means and the pitch pattern generated by the pitch pattern generation processing means. A speech synthesis apparatus comprising:

7. A voice synthesizing method for synthesizing a voice based on a pitch pattern generated by giving a plurality of point pitches based on accent information of a word or a phrase and interpolating between the given plurality of point pitches. In the speech synthesis method, at least four point pitch specifiable points are set within the duration of one syllable, and the point pitch is designated for two points or less.

8. A voice unit storage medium for storing voice feature parameters as voice units made of a predetermined synthesis unit, and a voice unit storage medium for storing voice based on input phoneme information. By using the accent parameter input corresponding to the phoneme information, the feature parameter generation processing means for reading out the voice unit, connecting the read voice units, and generating the feature parameter of the voice to be synthesized. A syllable-based point-pitch setting processing means for setting a point-pitch for each syllable, wherein at least four point-pitch specifiable times are set within the duration of one syllable, and two or less of these points are specified A pitch pattern of voice is generated by interpolating between the syllable-based point pitch setting processing means for designating the point pitch and the adjacent point pitch set by the syllable-based point pitch setting processing means. Pitch pattern generation processing means, and synthesis filter processing means for synthesizing speech from the characteristic parameters of the speech to be synthesized generated by the characteristic parameter generation processing means and the pitch pattern generated by the pitch pattern generation processing means. A speech synthesis apparatus comprising:

9. The method according to claim 2, further comprising language analysis processing means for analyzing text to be subjected to speech synthesis to generate phoneme information and accent information. The speech synthesizer according to claim 8.

10. A phonetic symbol string, which describes prosodic information including phonological information and pitch information of the phonetic speech, is input, and the phonetic symbol is selected from a group of phonetic pieces composed of a plurality of voice characteristic parameters prepared in advance. A plurality of speech units are selected and connected according to the phoneme information in the sequence to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosody information in the phonetic symbol sequence, A voice synthesizing method for synthesizing a voice based on a parameter expressing a phoneme and a pitch pattern, wherein the pitch information in the voice symbol string is given in a point pitch.

11. A speech unit storage medium for accumulating a group of speech units composed of a plurality of speech feature parameters prepared in advance, and phonological information of a speech to be synthesized and point pitches. At least the phoneme information in a phonetic symbol sequence describing prosody information including pitch information is input, and a plurality of phonemes are selected from the phoneme storage medium according to the phoneme information and connected to represent a phoneme of a voice. A phonological parameter generating means for generating a parameter, a prosodic parameter generating means for inputting at least prosodic information in the phonetic symbol string, and generating a pitch pattern of speech to be synthesized according to the prosodic information, and the phonological parameter generating means. A synthesis filter process for synthesizing a voice from the parameter expressing the generated phoneme and the pitch pattern generated by the prosody parameter generating means. A speech synthesis apparatus comprising: a processing means.

12. A phonetic symbol string which describes prosodic information including phonological information and pitch information of a voice is input, and the phonetic symbol is selected from a group of phoneme segments prepared from a plurality of voice characteristic parameters prepared in advance. A plurality of speech units are selected and connected according to the phoneme information in the sequence to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosody information in the phonetic symbol sequence, A voice synthesis method for synthesizing a voice based on a phoneme expressing parameter and a pitch pattern, wherein pitch information in the voice symbol string is given in a point pitch,
A method for synthesizing speech, characterized in that the position of the point pitch on the time axis is specified by the point pitch with reference to the starting point of each phoneme.

13. A speech unit storage medium for accumulating a group of speech units composed of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and point pitch. Pitch information, at least the phonological information in the phonetic symbol string describing the prosody information including the pitch information where the position on the time axis of the point pitch is specified by the point pitch with reference to the starting point of each phoneme. Phonological parameter generation means for inputting and selecting a plurality of speech units from the speech unit storage medium according to the phoneme information and connecting them to generate a parameter expressing a phoneme of a voice, and at least a prosody in the speech symbol string. Information is input, and a prosody parameter generating means for generating a pitch pattern of a voice to be synthesized according to the prosody information, and a phoneme generated by the phoneme parameter generating means are expressed. A voice synthesizing apparatus comprising: a parameter and a synthesis filter processing means for synthesizing a voice from the pitch pattern generated by the prosody parameter generating means.

14. A phonetic symbol string which describes prosodic information including phonological information and pitch information of a voice is input, and the phonetic symbol is selected from among a group of phoneme pieces prepared from a plurality of voice characteristic parameters prepared in advance. A plurality of speech units are selected and connected according to the phoneme information in the sequence to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosody information in the phonetic symbol sequence, A voice synthesis method for synthesizing a voice based on a phoneme expressing parameter and a pitch pattern, wherein pitch information in the voice symbol string is given in a point pitch,
A voice synthesis method characterized in that the position of the point pitch on the time axis is specified by the point pitch with reference to the start point of each syllable.

15. A speech unit storage medium for accumulating a group of speech units composed of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and point pitch. Pitch information, the position on the time axis of the point pitch is at least the phonological information in the phonetic symbol sequence describing prosody information including pitch information specified by the point pitch with reference to the starting point of each syllable. Phonological parameter generation means for inputting and selecting a plurality of speech units from the speech unit storage medium according to the phoneme information and connecting them to generate a parameter expressing a phoneme of a voice, and at least a prosody in the speech symbol string. Information is input, and a prosody parameter generating means for generating a pitch pattern of a voice to be synthesized according to the prosody information, and a phoneme generated by the phoneme parameter generating means are expressed. A voice synthesizing apparatus comprising: a parameter and a synthesis filter processing means for synthesizing a voice from the pitch pattern generated by the prosody parameter generating means.

16. A phoneme in a phonetic symbol string from a group of phoneme units prepared from a plurality of voice characteristic parameters prepared in advance, using a phonetic symbol string describing phoneme information and prosodic information of the phonetic as an input. A plurality of speech units are selected and connected according to the information to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosodic information in the phonetic symbol string to express the phoneme of the voice. A voice synthesizing method for synthesizing a voice based on a parameter and a pitch pattern, wherein at least four point pitch specifiable points are set within the duration of each phoneme included in the phoneme information in the phonetic symbol sequence, A speech synthesis method characterized in that prosodic information in the speech symbol string is described by designating a point pitch for time points of two points or less.

17. A speech unit storage medium for accumulating a group of speech units consisting of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and included in the phonological information. At least four point pitch specifiable time points are set within the duration of each phoneme to be recorded, and at least the above-mentioned prosodic information in the phonetic symbol sequence describing prosodic information for which the point pitch is specified for time points of two points or less. Phonological parameter generation means for inputting phonological information, selecting a plurality of speech units from the speech unit storage medium according to the phonological information and connecting them, and generating a parameter expressing a phoneme of a voice; Of at least prosody information, and a prosody parameter generation means for generating a pitch pattern of a voice to be synthesized according to the prosody information, and a phoneme generated by the phonology parameter generation means. And a synthesis filter processing unit for synthesizing a voice from the pitch pattern generated by the prosody parameter generating unit.

18. A phoneme in a phonetic symbol string from a group of phoneme segments made up of a plurality of phonetic characteristic parameters prepared in advance, with a phonetic symbol string describing phoneme information and prosody information of the phoneme as an input. A plurality of speech units are selected and connected according to the information to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosodic information in the phonetic symbol string to express the phoneme of the voice. A voice synthesizing method for synthesizing a voice based on a parameter and a pitch pattern, wherein at least four point pitch specifiable points are set within the duration of each phoneme included in the phoneme information in the phonetic symbol sequence, It is characterized in that the prosodic information in the phonetic symbol string is described by designating the position on the time axis of the point pitch with respect to the starting point of each phoneme for two or less points among them. Speech synthesis method to be.

19. A speech segment storage medium for accumulating a group of speech segments composed of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and included in the phonological information. Within the duration of each phoneme to be set, at least four point pitch specifiable points are set, and for points of less than two points,
At least the phoneme information in the phonetic symbol string describing the prosody information in which the position on the time axis of the point pitch is specified on the basis of the starting point of each phoneme is input, and the phoneme storage medium is input according to the phoneme information. Phonological parameter generation means for selecting a plurality of speech units and connecting them to generate a parameter for expressing a phoneme of a speech, and inputting at least prosody information in the speech symbol string, of a speech to be synthesized according to the prosody information. A prosody parameter generating means for generating a pitch pattern, and a synthesis filter processing means for synthesizing a voice from the parameter expressing the phoneme generated by the phoneme parameter generating means and the pitch pattern generated by the prosody parameter generating means. A speech synthesizer characterized by the above.

20. A phoneme in a voice symbol string is selected from a group of voice units prepared from a plurality of voice characteristic parameters prepared in advance, with a voice symbol string describing phoneme information and prosody information of the voice as an input. A plurality of speech units are selected and connected according to the information to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosodic information in the phonetic symbol string to express the phoneme of the voice. A voice synthesizing method for synthesizing a voice based on a parameter and a pitch pattern, wherein at least four point pitch specifiable points are set within a duration of each syllable included in the phoneme information in the voice symbol string, A speech synthesis method characterized in that prosodic information in the speech symbol string is described by designating a point pitch for time points of two points or less.

21. A speech unit storage medium for accumulating a group of speech units composed of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and included in the phonological information. At least four point pitch specifiable points are set within the duration of each syllable, and at least the above-mentioned prosodic information in the phonetic symbol sequence describing prosodic information for which the point pitch is designated for two points or less. Phonological parameter generation means for inputting phonological information, selecting a plurality of speech units from the speech unit storage medium according to the phonological information and connecting them, and generating a parameter expressing a phoneme of a voice; Of at least prosody information, and a prosody parameter generation means for generating a pitch pattern of a voice to be synthesized according to the prosody information, and a phoneme generated by the phonology parameter generation means. And a synthesis filter processing unit for synthesizing a voice from the pitch pattern generated by the prosody parameter generating unit.

22. A phoneme in a voice symbol string is selected from among a group of voice segments prepared in advance, each of which is prepared by inputting a voice symbol string describing phoneme information and prosody information of the voice. A plurality of speech units are selected and connected according to the information to generate parameters for expressing the phoneme of the voice, and a pitch pattern of the voice is generated according to the prosodic information in the phonetic symbol string to express the phoneme of the voice. A voice synthesizing method for synthesizing a voice based on a parameter and a pitch pattern, wherein at least four point pitch specifiable points are set within a duration of each syllable included in the phoneme information in the voice symbol string, It is characterized in that the prosodic information in the phonetic symbol string is described by designating the position of the point pitch on the time axis with respect to the starting point of each syllable for the time points of two points or less. Speech synthesis method to be.

23. A speech unit storage medium for accumulating a group of speech units composed of a plurality of speech feature parameters prepared in advance, phonological information of speech to be synthesized, and included in the phonological information. Set at least 4 point pitch specifiable time points within the duration of each syllable, and for 2 points or less,
Input at least the phonological unit information in the phonetic symbol sequence describing prosodic information in which the position on the time axis of the point pitch is specified based on the starting point of each syllable, and from the phoneme storage medium according to the phonological unit information. Phonological parameter generation means for selecting a plurality of speech units and connecting them to generate a parameter for expressing a phoneme of a speech, and inputting at least prosody information in the speech symbol string, of a speech to be synthesized according to the prosody information. A prosody parameter generating means for generating a pitch pattern, and a synthesis filter processing means for synthesizing a voice from the parameter expressing the phoneme generated by the phoneme parameter generating means and the pitch pattern generated by the prosody parameter generating means. A speech synthesizer characterized by the above.

24. A group of speech units consisting of a plurality of speech characteristic parameters prepared in advance, using as input a speech symbol string describing prosodic information including phonological information of speech and pitch information given by point pitch. From among them, a plurality of speech units are selected and connected according to the phoneme information in the phonetic symbol string to generate a parameter for expressing the phoneme of the phonetic symbol, and a pitch pattern of the voice is determined according to the prosodic information in the phonetic symbol string. A voice symbol sequence editing device which is applied to a voice synthesizing device for synthesizing a voice based on a parameter expressing a phoneme of the voice and a pitch pattern, the plurality of points according to pitch information in the voice symbol sequence. Display means for displaying the pitch, input means for instructing to change the point pitch displayed by the display means, and response to the change instruction by the input means. Then, the phonetic symbol string editing device further comprises a phonetic symbol string correction processing unit for changing the corresponding point pitch being displayed by the display unit and reflecting the change result in the phonetic symbol string.

25. A group of speech units consisting of a plurality of speech characteristic parameters prepared in advance, using as input a speech symbol string describing prosodic information including phonological information of speech and pitch information given by point pitch. From among them, a plurality of speech units are selected and connected according to the phoneme information in the phonetic symbol string to generate a parameter for expressing the phoneme of the phonetic symbol, and a pitch pattern of the voice is determined according to the prosodic information in the phonetic symbol string. A voice symbol sequence editing device which is applied to a voice synthesizing device for synthesizing a voice based on a parameter expressing a phoneme of the voice and a pitch pattern, the plurality of points according to pitch information in the voice symbol sequence. Display means for displaying the pitch in a graph, input means for inputting an instruction to move the position of the point pitch displayed by the display means, and this input Moving the position of the corresponding point pitch being displayed by the display means on the screen in response to the movement instruction from the means,
A voice symbol sequence editing device, comprising: a voice symbol sequence correction processing means for reflecting the position of the new point pitch after the movement in the voice symbol sequence.

26. A group of speech units consisting of a plurality of speech characteristic parameters prepared in advance, using as input a speech symbol string describing prosodic information including speech phonological information and pitch information given by point pitch. From among them, a plurality of speech units are selected and connected according to the phoneme information in the phonetic symbol string to generate a parameter for expressing the phoneme of the phonetic symbol, and a pitch pattern of the voice is determined according to the prosodic information in the phonetic symbol string. A voice symbol sequence editing device applied to a voice synthesizing device for generating and synthesizing a voice based on a parameter expressing a phoneme of the voice and a pitch pattern, comprising: Display means for displaying the point pitch and the pitch contour obtained by interpolating between these point pitches in a graph, and the position of the point pitch displayed by this display means are moved. That an input means for inputting an instruction, the position of the corresponding point pitch being displayed by the display unit to move on the screen in accordance with the movement instruction by the input means,
A voice symbol sequence editing device, comprising: a voice symbol sequence correction processing means for reflecting the position of the new point pitch after the movement in the voice symbol sequence.

27. The pitch contour obtained by analyzing the reference voice corresponding to the phoneme information in the voice symbol string is graphically displayed as a reference pattern on the display means. Item 26. A voice symbol string editing device according to item 26.