JPH1195798A

JPH1195798A - Method and device for voice synthesis

Info

Publication number: JPH1195798A
Application number: JP9273950A
Authority: JP
Inventors: Toshio Motegi; 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1997-09-19
Filing date: 1997-09-19
Publication date: 1999-04-09

Abstract

PROBLEM TO BE SOLVED: To efficiently synthesize human voices with the degree of freedom in the length of sound and intervals. SOLUTION: Phonemes such as fundamental fifty sounds, voiced sounds and contracted sounds are uttered by an announcer with fundamental sound lengths and intervals and the sound waveforms are taken in as PCM data. Then, plural unit segments d1, d2,... are defined on the time axis of the sound waveforms of one phoneme, the sound waveforms are Fourier-transformed for every segment, four representative frequencies are extracted and corresponding four MIDI data are defined. Then, the defined four MIDI data are arranged on four tracks T1 to T4. Then, the MIDI data of the segments d1 to d9 are arranged at positions k1 to k9 and the data are classified into the data that contribute to vowels, and the data that contribute to consonants. During the reproducing of specified phonemes while specifying sound lengths and intervals, compensation on sound lengths (delta time) and interval (note numbers) is conducted against the MIDI data, which contribute to vowels, and a reproducing is performed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は音声合成方法および
音声合成装置に関し、特に、コンピュータを利用して、
人間の話し声や歌声を合成する技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing method and a speech synthesizing apparatus.
It relates to technology for synthesizing human speech and singing voice.

【０００２】[0002]

【従来の技術】人間の話し声を疑似的に合成する手法
は、種々の分野で利用されており、コンピュータを内蔵
した種々の電子機器が疑似的に会話を行う機能をもつよ
うになってきている。人間の音声を符号化する方法とし
て最も普及している手法は、ＰＣＭ（Pulse Code Modul
ation ）の技術であり、音響信号をデジタル化して取り
扱う分野で広く利用されている。このＰＣＭの手法の基
本原理は、アナログ音響信号を所定のサンプリング周波
数でサンプリングし、各サンプリング時の信号強度を量
子化してデジタルデータとして表現する点にあり、サン
プリング周波数や量子化ビット数を高くすればするほ
ど、原音を忠実に再生することが可能になる。ただ、サ
ンプリング周波数や量子化ビット数を高くすればするほ
ど、必要な情報量も増えることになる。そこで、できる
だけ情報量を低減するための手法として、信号の変化差
分のみを符号化するＡＤＰＣＭ（Adaptive Differentia
l PulseCode Modulation ）の手法も用いられている。2. Description of the Related Art A method of artificially synthesizing a human voice has been used in various fields, and various electronic devices having a built-in computer have a function of simulating a conversation. . The most widespread method for encoding human speech is PCM (Pulse Code Modul).
ation) technology, which is widely used in the field of digitizing and handling audio signals. The basic principle of this PCM method is that an analog audio signal is sampled at a predetermined sampling frequency, and the signal strength at each sampling is quantized and represented as digital data. The more it is, the more faithful it is possible to reproduce the original sound. However, the higher the sampling frequency and the number of quantization bits, the larger the required information amount. Therefore, as a technique for reducing the amount of information as much as possible, an ADPCM (Adaptive Differentia) that encodes only a signal change difference is used.
l PulseCode Modulation) is also used.

【０００３】電子機器に疑似的に会話を行わせる機能を
もたせる場合、通常、このＰＣＭの手法が利用されてい
る。たとえば、人間の話し声を単語ごとにＰＣＭの手法
で符号化して記録しておき、必要なときに必要な単語に
ついて復号化を行えば、特定の単語を再生することがで
きる。複数の単語を所定の順序で再生すれば、疑似的に
会話を行わせることが可能になる。[0003] When an electronic device has a function of simulating a conversation, the PCM method is usually used. For example, if a human voice is encoded and recorded for each word by the PCM method and a necessary word is decoded when necessary, a specific word can be reproduced. If a plurality of words are reproduced in a predetermined order, it is possible to have a pseudo conversation.

【０００４】一方、電子楽器による楽器音を符号化しよ
うという発想から生まれたＭＩＤＩ（Musical Instrume
nt Digital Interface）規格も、パーソナルコンピュー
タの普及とともに盛んに利用されるようになってきてい
る。このＭＩＤＩ規格による符号データ（以下、ＭＩＤ
Ｉデータという）は、基本的には、楽器のどの鍵盤キー
を、どの程度の強さで弾いたか、という楽器演奏の操作
を記述したデータであり、このＭＩＤＩデータ自身に
は、実際の音の波形は含まれていない。そのため、実際
の音を再生する場合には、楽器音の波形を記憶したＭＩ
ＤＩ音源が別途必要になる。しかしながら、上述したＰ
ＣＭの手法で音を記録する場合に比べて、情報量が極め
て少なくてすむという特徴を有し、その符号化効率の高
さが注目を集めている。このＭＩＤＩ規格による符号化
および復号化の技術は、現在、パーソナルコンピュータ
を用いて楽器演奏、楽器練習、作曲などを行うソフトウ
エアに広く採り入れられており、カラオケ、ゲームの効
果音といった分野でも広く利用されている。On the other hand, MIDI (Musical Instrume) was born from the idea of encoding musical instrument sounds by electronic musical instruments.
The Digital Interface (nt Digital Interface) standard has also been actively used with the spread of personal computers. Code data according to the MIDI standard (hereinafter, MID)
I data) is basically data that describes the operation of playing a musical instrument, such as which keyboard key of the musical instrument was played and with what strength. The MIDI data itself contains the actual sound. No waveform is included. Therefore, when reproducing the actual sound, the MI which stores the waveform of the musical instrument sound is used.
A DI sound source is required separately. However, the P
Compared to the case where sound is recorded by the CM method, the amount of information is extremely small. This encoding and decoding technology based on the MIDI standard is now widely used in software for playing musical instruments, practicing musical instruments, composing music, etc. using a personal computer, and is also widely used in fields such as karaoke and game sound effects. Have been.

【０００５】[0005]

【発明が解決しようとする課題】上述したＰＣＭの手法
を利用すれば、人間の話し声をある程度忠実に合成する
ことが可能である。しかしながら、ＰＣＭによる符号化
データは、話し声の波形情報をそのままもっているた
め、かなり情報量は大きくなり、データ処理の負担が重
くならざるを得ない。また、ＰＣＭによって符号化され
た音声データに対して、音長を変えたり、音程を変えた
りするデータ処理技術は、現在のところ実用レベルでは
確立されておらず、音長や音程を自由に変えて再生する
ことは困難である。このため、従来の音声合成方法によ
って再現される話し声は、アクセントや節のない、いわ
ゆる「棒読み」調の単調な音声にならざるを得ない。し
たがって、これまでのＰＣＭの手法では、歌声などを合
成することはできない。If the above-described PCM technique is used, it is possible to synthesize human speech to some extent faithfully. However, since the coded data by PCM has the waveform information of the spoken voice as it is, the amount of information is considerably large, and the data processing load has to be heavy. Further, at present, a data processing technique of changing a sound length or a pitch of voice data encoded by PCM has not been established at a practical level, and a sound length and a pitch can be freely changed. It is difficult to play. For this reason, the spoken voice reproduced by the conventional voice synthesis method must be a so-called "stick reading" monotonous voice without any accent or clause. Therefore, the singing voice and the like cannot be synthesized by the conventional PCM method.

【０００６】一方、ＭＩＤＩ規格による符号化の手法を
採れば、非常に少ない情報量で十分な音質をもった音の
再生が可能であり、しかも音長や音程を自由に変えるこ
とが可能である。このため、楽器音を取り扱う分野で
は、ＭＩＤＩ符号化が盛んに利用されている。しかしな
がら、上述したように、ＭＩＤＩ規格そのものが、もと
もと楽器演奏の操作を符号化するためのものであるた
め、これまでのＭＩＤＩ符号化の手法では、人間の話し
声や歌声を合成することは困難である。On the other hand, if an encoding method based on the MIDI standard is adopted, it is possible to reproduce a sound having a sufficient sound quality with a very small amount of information and to freely change the sound length and the pitch. . For this reason, in the field of handling musical instrument sounds, MIDI encoding is actively used. However, as described above, since the MIDI standard itself is originally for encoding the operation of musical instrument performance, it is difficult to synthesize human voice and singing voice by the conventional MIDI encoding method. is there.

【０００７】そこで本発明は、人間の音声を効率良く、
かつ、音長や音程の自由度をもって合成することができ
る音声合成方法および音声合成装置を提供することを目
的とする。Therefore, the present invention efficiently converts human voice,
It is another object of the present invention to provide a speech synthesis method and a speech synthesis device capable of synthesizing with a freedom of a sound length and a pitch.

【０００８】[0008]

[Means for Solving the Problems]

(1) 本発明の第１の態様は、特定の音素に対応する人
間の声を合成する方法において、１つの音素について人
間の発音波形をデータとして取り込み、この発音波形の
時間軸上に複数の単位区間を設定し、個々の単位区間ご
とに当該単位区間内の発音波形に含まれる代表周波数お
よびその強度を示す符号コードを作成し、１つの音素に
対応する符号コード群を定義する符号コード定義処理
を、必要な各音素についてそれぞれ行い、各音素につい
て定義された符号コード群をそれぞれ記録した音素符号
テーブルを用意し、特定の音素に対応する人間の声を合
成する合成指示が与えられたときに、用意した音素符号
テーブルを参照することにより、特定の音素について定
義された符号コード群を抽出し、抽出した符号コード群
を所定の音源を用いて再生することにより人間の声を合
成するようにしたものである。(1) According to a first aspect of the present invention, in a method of synthesizing a human voice corresponding to a specific phoneme, a human pronunciation waveform is captured as data for one phoneme, and a plurality of waveforms are provided on the time axis of the pronunciation waveform. A code section defining a unit section, creating a code code indicating a representative frequency and its intensity included in the sound waveform in the unit section for each unit section, and defining a code code group corresponding to one phoneme When a process is performed for each required phoneme, a phoneme code table that records code groups defined for each phoneme is prepared, and a synthesis instruction to synthesize a human voice corresponding to a specific phoneme is given. Then, by referring to the prepared phoneme code table, a code code group defined for a specific phoneme is extracted, and the extracted code code group is reproduced using a predetermined sound source. It synthesizes human voices by producing them.

【０００９】(2) 本発明の第２の態様は、上述の第１
の態様に係る音声合成方法において、発音波形の時間軸
上に複数の単位区間を設定する際に、隣接する単位区間
が時間軸上で部分的に重複するような設定を行うように
したものである。(2) The second aspect of the present invention is the above-mentioned first aspect.
In the voice synthesizing method according to the aspect, when setting a plurality of unit sections on the time axis of the sounding waveform, setting is performed such that adjacent unit sections partially overlap on the time axis. is there.

【００１０】(3) 本発明の第３の態様は、上述の第１
または第２の態様に係る音声合成方法において、１つの
単位区間について複数Ｐ個の代表周波数を定義すること
により複数Ｐ個の符号コードを作成し、これらの符号コ
ードをＰ本のトラックに分離して収容し、符号コードの
再生時には、Ｐ本のトラックに収容された符号コードを
同時に再生しＰチャンネルの再生音を得るようにしたも
のである。(3) A third aspect of the present invention is the above-mentioned first aspect.
Alternatively, in the speech synthesis method according to the second aspect, a plurality of P code codes are created by defining a plurality of P representative frequencies for one unit section, and these code codes are separated into P tracks. When the code codes are reproduced, the code codes stored in the P tracks are simultaneously reproduced to obtain a P channel reproduction sound.

【００１１】(4) 本発明の第４の態様は、上述の第１
〜第３の態様に係る音声合成方法において、各音素につ
いて定義された符号コード群を、母音に寄与する符号コ
ードと子音に寄与する符号コードとに分類して音素符号
テーブルに記録し、特定の音素についての合成指示に、
当該特定の音素を発音する上での音長を指示するパラメ
ータを含ませるようにし、与えられた合成指示に基く再
生を行う際に、音素符号テーブルから抽出された符号コ
ードのうちの母音に寄与する符号コードについては、音
長を指示するパラメータに基いて音長の補正を行うよう
にしたものである。(4) The fourth aspect of the present invention is the above-mentioned first aspect.
In the speech synthesis method according to the third to third aspects, a code code group defined for each phoneme is classified into a code code contributing to a vowel and a code code contributing to a consonant, and recorded in a phoneme code table. In the synthesis instructions for phonemes,
Including a parameter indicating the length of sound in generating the specific phoneme, and when performing reproduction based on the given synthesis instruction, the parameter contributes to the vowel of the code codes extracted from the phoneme code table. The code length is corrected based on the parameter indicating the sound length.

【００１２】(5) 本発明の第５の態様は、上述の第１
〜第３の態様に係る音声合成方法において、各音素につ
いて定義された符号コード群を、母音に寄与する符号コ
ードと子音に寄与する符号コードとに分類して音素符号
テーブルに記録し、特定の音素についての合成指示に、
当該特定の音素を発音する上での音程を指示するパラメ
ータを含ませるようにし、与えられた合成指示に基く再
生を行う際に、音素符号テーブルから抽出された符号コ
ードのうちの母音に寄与する符号コードについては、音
程を指示するパラメータに基いて音程の補正を行うよう
にしたものである。(5) The fifth aspect of the present invention is the above-mentioned first aspect.
In the speech synthesis method according to the third to third aspects, a code code group defined for each phoneme is classified into a code code contributing to a vowel and a code code contributing to a consonant, and recorded in a phoneme code table. In the synthesis instructions for phonemes,
A parameter indicating a pitch at which the specific phoneme is generated is included, and when performing reproduction based on a given synthesis instruction, the parameter contributes to a vowel of the code codes extracted from the phoneme code table. With regard to the code code, the pitch is corrected based on a parameter indicating the pitch.

【００１３】(6) 本発明の第６の態様は、上述の第４
または第５の態様に係る音声合成方法において、１つの
単位区間について複数Ｐ個の代表周波数を定義すること
により複数Ｐ個の符号コードを作成し、これらの符号コ
ードを周波数に基いてソートすることによりＰ本のトラ
ックに分離して収容し、母音のみからなる音素について
得られた符号コード群については、Ｐ本のトラックに収
容された全符号コードを母音に寄与する符号コードと
し、子音のみからなる音素について得られた符号コード
群については、Ｐ本のトラックに収容された全符号コー
ドを子音に寄与する符号コードとし、母音および子音を
含む音素について得られた符号コード群については、低
周波数側のｎ本のトラックに収容された符号コードを母
音に寄与する符号コードとし、残りの（Ｐ−ｎ）本のト
ラックに収容された符号コードを子音に寄与する符号コ
ードとするようにしたものである。(6) The sixth aspect of the present invention is the above-described fourth aspect.
Alternatively, in the speech synthesis method according to the fifth aspect, a plurality of P code codes are created by defining a plurality of P representative frequencies for one unit section, and these code codes are sorted based on the frequencies. For code groups obtained for phonemes consisting of vowels only, which are stored separately in P tracks, all code codes stored in the P tracks are considered to be code codes contributing to vowels. For the code group obtained for the phoneme, the code codes contributing to the consonants are all the code codes contained in the P tracks, and for the code group obtained for the phoneme including vowels and consonants, The code codes stored in the n tracks on the side are code codes that contribute to vowels, and the code codes stored in the remaining (Pn) tracks are used. The issue code is obtained so as to contribute code code consonants.

【００１４】(7) 本発明の第７の態様は、上述の第４
または第５の態様に係る音声合成方法において、母音の
みからなる音素について得られた符号コード群について
は、全符号コードを母音に寄与する符号コードとし、子
音のみからなる音素について得られた符号コード群につ
いては、全符号コードを子音に寄与する符号コードと
し、母音および子音を含む音素について得られた符号コ
ード群については、当該音素を音長を変えて発音するこ
とにより複数通りの発音波形を取り込み、これら複数通
りの発音波形のそれぞれについて符号コード群を作成
し、作成された複数通りの符号コード群の中で、所定の
許容範囲内の変化しか示さない符号コードを子音に寄与
する符号コードとし、残りの符号コードを母音に寄与す
る符号コードとするようにしたものである。(7) The seventh aspect of the present invention is directed to the above-mentioned fourth aspect.
Alternatively, in the speech synthesis method according to the fifth aspect, for code groups obtained for phonemes consisting only of vowels, all code codes are code codes contributing to vowels, and code codes obtained for phonemes consisting only of consonants are obtained. For a group, all code codes are used as code codes contributing to consonants, and for a code code group obtained for phonemes including vowels and consonants, a plurality of types of pronunciation waveforms are produced by changing the length of the phonemes. A code code for each of these plural types of sound waveforms is created, and a code code that indicates only a change within a predetermined allowable range among the plurality of generated code codes that contributes to a consonant. And the remaining code codes are code codes that contribute to vowels.

【００１５】(8) 本発明の第８の態様は、上述の第４
または第５の態様に係る音声合成方法において、母音の
みからなる音素について得られた符号コード群について
は、全符号コードを母音に寄与する符号コードとし、子
音のみからなる音素について得られた符号コード群につ
いては、全符号コードを子音に寄与する符号コードと
し、母音および子音を含む音素について得られた符号コ
ード群については、当該音素を音程を変えて発音するこ
とにより複数通りの発音波形を取り込み、これら複数通
りの発音波形のそれぞれについて符号コード群を作成
し、作成された複数通りの符号コード群の中で、所定の
許容範囲内の変化しか示さない符号コードを子音に寄与
する符号コードとし、残りの符号コードを母音に寄与す
る符号コードとするようにしたものである。(8) The eighth aspect of the present invention is the above-described fourth aspect.
Alternatively, in the speech synthesis method according to the fifth aspect, for code groups obtained for phonemes consisting only of vowels, all code codes are code codes contributing to vowels, and code codes obtained for phonemes consisting only of consonants are obtained. For a group, all code codes are used as code codes contributing to consonants, and for code groups obtained for phonemes including vowels and consonants, a plurality of pronunciation waveforms are captured by changing the pitch of the phoneme. A code code group is created for each of the plurality of tone waveforms, and among the created code code groups, a code code showing only a change within a predetermined allowable range is regarded as a code code contributing to a consonant. , And the remaining code codes are code codes that contribute to vowels.

【００１６】(9) 本発明の第９の態様は、上述の第１
〜第８の態様に係る音声合成方法において、単位区間内
の発音波形に含まれる代表周波数をノートナンバーで示
し、この代表周波数の強度をベロシティーで示し、単位
区間の区間長に対応する時間をデルタタイムで示すこと
により、ＭＩＤＩ形式の符号データを作成し、作成した
ＭＩＤＩ形式の符号データを再生する際に、人間の音声
に基く音響波形をもったＭＩＤＩ音源を用いるようにし
たものである。(9) The ninth aspect of the present invention is the above-mentioned first aspect.
In the voice synthesizing method according to the eighth to eighth aspects, a representative frequency included in a sound waveform in a unit section is indicated by a note number, an intensity of the representative frequency is indicated by velocity, and a time corresponding to the section length of the unit section is indicated. By expressing the data in delta time, MIDI code data is created, and when reproducing the created MIDI code data, a MIDI sound source having an acoustic waveform based on human voice is used.

【００１７】(10) 本発明の第１０の態様は、上述の第
１〜第９の態様に係る音声合成方法において、隣接する
複数の単位区間について、所定の条件下で互いに類似す
る符号コードがある場合、これら類似する符号コード
を、複数の単位区間に跨がった統合符号コードに置換す
る処理を行うようにしたものである。(10) In a tenth aspect of the present invention, in the speech synthesis method according to the first to ninth aspects, code codes similar to each other are determined under a predetermined condition for a plurality of adjacent unit sections. In some cases, a process of replacing these similar code codes with an integrated code code extending over a plurality of unit sections is performed.

【００１８】(11) 本発明の第１１の態様は、特定の音
素に対応する人間の声を合成する音声合成装置におい
て、合成対象となる特定の音素および音長を指示する合
成指示データを入力するデータ入力部と、必要な個々の
音素について、基準音長で発音した人間の発音波形を再
現するための符号コード群を、母音に寄与する符号コー
ドと子音に寄与する符号コードとに分類して記録した音
素符号テーブルと、合成指示データに基づいて、音素符
号テーブルから合成対象となる特定の音素についての符
号コード群を読出し、読出した符号コード群のうちの母
音に寄与する符号コードについては、合成指示データに
基づいて音長の補正を行う符号コード補正部と、所定の
音源を用いて、補正後の符号コード群を再生し、人間の
声の合成音を発声させる音声再生部と、を設けたもので
ある。(11) According to an eleventh aspect of the present invention, in a speech synthesizer for synthesizing a human voice corresponding to a specific phoneme, synthesis instruction data for designating a specific phoneme and a sound length to be synthesized is input. For each of the necessary phonemes and the required phonemes, code groups for reproducing the pronunciation waveform of a human pronounced at the reference pitch are classified into code codes contributing to vowels and code codes contributing to consonants. A code group for a specific phoneme to be synthesized is read from the phoneme code table based on the phoneme code table recorded and the synthesis instruction data, and a code code that contributes to a vowel in the read code code group is read out. Using a code code correction unit that corrects the sound length based on the synthesis instruction data, and a predetermined sound source, reproduce the corrected code code group, and utter a synthesized voice of a human voice. Those provided with sound reproduction unit.

【００１９】(12) 本発明の第１２の態様は、特定の音
素に対応する人間の声を合成する音声合成装置におい
て、合成対象となる特定の音素および音程を指示する合
成指示データを入力するデータ入力部と、必要な個々の
音素について、基準音程で発音した人間の発音波形を再
現するための符号コード群を、母音に寄与する符号コー
ドと子音に寄与する符号コードとに分類して記録した音
素符号テーブルと、合成指示データに基づいて、音素符
号テーブルから合成対象となる特定の音素についての符
号コード群を読出し、読出した符号コード群のうちの母
音に寄与する符号コードについては、合成指示データに
基づいて音程の補正を行う符号コード補正部と、所定の
音源を用いて、補正後の符号コード群を再生し、人間の
声の合成音を発声させる音声再生部と、を設けたもので
ある。(12) According to a twelfth aspect of the present invention, in a speech synthesizer for synthesizing a human voice corresponding to a specific phoneme, synthesis instruction data indicating a specific phoneme to be synthesized and a pitch is input. For the data input unit and the required individual phonemes, code codes for reproducing the pronunciation waveform of a human pronounced at the reference interval are classified and recorded as code codes contributing to vowels and code codes contributing to consonants. A code group for a specific phoneme to be synthesized is read from the phoneme code table based on the obtained phoneme code table and the synthesis instruction data, and code codes that contribute to vowels in the read code code group are synthesized. Using a code code correction unit that corrects the pitch based on the instruction data, and a predetermined sound source, reproduce the corrected code code group and produce a synthetic voice of a human voice. Those provided with sound reproduction unit.

【００２０】[0020]

【発明の実施の形態】以下、本発明を図示する実施形態
に基づいて説明する。本明細書では、図１〜図４を参照
して、この実施形態の前段部分の手順を説明し、続い
て、図５および図６を参照して、この実施形態の後段部
分の手順を説明する。そして、最後に図７を参照して、
本発明に係る音声合成装置の構成を説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below based on an embodiment shown in the drawings. In this specification, the procedure of the former part of this embodiment will be described with reference to FIGS. 1 to 4, and subsequently, the procedure of the latter part of this embodiment will be described with reference to FIGS. 5 and 6. I do. And finally, referring to FIG.
The configuration of the speech synthesizer according to the present invention will be described.

【００２１】§１．本発明の前段部分の手順図１は、本発明に係る音声合成方法の前段部分となる音
素の符号化処理の基本概念を示す図である。本発明の前
段部分は、音素符号テーブルを用意するための処理であ
り、いわば音声合成を行う上での準備段階に相当する。
音素符号テーブルは、多数の音素について、それぞれ符
号コード群を記録したテーブルであり、後述する後段部
分では、この音素符号テーブルに記録された符号コード
に基づいて、合成音を再生することになる。なお、本願
における「音素」とは、発音した場合に、１まとまりの
音として認識できる音声を意味し、日本語の場合は、表
音文字１文字の発音にほぼ対応する。 §1. Procedure Figure 1 of the front part of the present invention is a diagram showing the basic concept of the encoding processing for phonemes which is a front portion of the speech synthesis method according to the present invention. The first stage of the present invention is a process for preparing a phoneme code table, which is equivalent to a preparation stage for performing speech synthesis.
The phoneme code table is a table in which a code code group is recorded for each of a large number of phonemes. In a later part described later, a synthesized sound is reproduced based on the code codes recorded in the phoneme code table. In the present application, the “phoneme” means a voice that can be recognized as a group of sounds when it is pronounced, and in the case of Japanese, it substantially corresponds to the pronunciation of one phonogram.

【００２２】ここで述べる前段部分では、１つ１つの音
素単位に、人間の声を合成することになる。まず、１つ
の音素について人間の発音波形をデータとして取り込む
処理が行われる。たとえば、人間に「サ」という文字を
発音してもらい、これをデータとして取り込めば、図１
(a) に示すような発音波形Ｗが得られる。ここで述べる
実施形態では、一般的なＰＣＭの手法を用いて、この発
音波形Ｗを所定周期でサンプリングし、デジタルデータ
としてコンピュータに取り込んでいる。In the preceding stage described here, a human voice is synthesized for each phoneme unit. First, a process of capturing a human pronunciation waveform as data for one phoneme is performed. For example, if you ask a human to pronounce the letter “sa” and import it as data,
A sound waveform W as shown in FIG. In the embodiment described here, this sound generation waveform W is sampled at a predetermined period by using a general PCM technique, and is taken into a computer as digital data.

【００２３】続いて、この発音波形Ｗの時間軸上に複数
の単位区間を設定する。図１(b) には、図１(a) に示す
発音波形Ｗの時間軸ｔに対応して、複数の単位区間ｄ
１，ｄ２，ｄ３，…が設定されている例が示されてい
る。単位区間の設定を行う際には、図示の例のように、
隣接する単位区間が時間軸上で部分的に重複するように
するのが好ましい。この例では、発音波形を２２ｋＨｚ
のサンプリング周波数でサンプリングすることによりデ
ジタルデータとして取り込み、個々の単位区間の区間長
を１０２４サンプル分（約４７ｍｓｅｃ）に設定し、隣
接する単位区間の先頭部分を２０サンプル分（約０．９
ｍｓｅｃ）ずつ順にずらしながら、各単位区間の設定を
行っている。Subsequently, a plurality of unit sections are set on the time axis of the sound generation waveform W. FIG. 1B shows a plurality of unit sections d corresponding to the time axis t of the sound generation waveform W shown in FIG.
An example in which 1, d2, d3,... Are set is shown. When setting the unit section, as shown in the example in the figure,
It is preferable that adjacent unit sections partially overlap on the time axis. In this example, the sounding waveform is 22 kHz
The sampling frequency is fetched as digital data, the section length of each unit section is set to 1024 samples (approximately 47 msec), and the head of an adjacent unit section is set to 20 samples (approximately 0.9 msec).
Each unit section is set while shifting in order by msec).

【００２４】こうして単位区間の設定が完了したら、個
々の単位区間ごとに当該単位区間内の発音波形Ｗに含ま
れる代表周波数およびその強度を示す符号コードを作成
し、１つの音素に対応する符号コード群を定義する符号
コード定義処理を実行する。以下、この符号コード定義
処理を、フーリエ変換を利用した具体的な一例に基づい
て説明する。When the setting of the unit section is completed, a code code indicating the representative frequency and its intensity included in the sound waveform W in the unit section is created for each unit section, and a code code corresponding to one phoneme is generated. A code code definition process for defining a group is executed. Hereinafter, this code code definition processing will be described based on a specific example using Fourier transform.

【００２５】まず、図１(b) に示すような単位区間ｄ
１，ｄ２，ｄ３，…が設定されたら、各単位区間に所属
する発音波形Ｗに対してそれぞれフーリエ変換を行い、
スペクトルを作成する。図２(a) には、単位区間ｄ１に
ついて作成されたスペクトルの一例が示されている。こ
のスペクトルでは、横軸上に定義された周波数ｆによっ
て、単位区間ｄ１内の発音波形Ｗに含まれる周波数成分
（０〜Ｆｓ：ここでＦｓはサンプリング周波数）が示さ
れており、縦軸上に定義された複素強度Ａによって、各
周波数成分ごとの複素強度が示されている。なお、この
ようなスペクトルを得る手法としては、フーリエ変換の
他にも種々の手法が知られており、どのような手法を利
用してもかまわない。また、アナログ波形から直接的に
スペクトルを作成する手法を用いれば、発音波形ＷをＰ
ＣＭの手法でデジタル化する必要はない。First, a unit section d as shown in FIG.
, D2, d3,... Are set, Fourier transform is performed on each of the sound waveforms W belonging to each unit section,
Create a spectrum. FIG. 2A shows an example of a spectrum created for the unit section d1. In this spectrum, the frequency components (0 to Fs: where Fs is a sampling frequency) included in the sound waveform W in the unit section d1 are indicated by the frequency f defined on the horizontal axis, and on the vertical axis. The complex intensity A for each frequency component is indicated by the defined complex intensity A. As a method for obtaining such a spectrum, various methods are known in addition to the Fourier transform, and any method may be used. Further, if a method of directly creating a spectrum from an analog waveform is used, the sounding waveform W
There is no need to digitize using the CM method.

【００２６】次に、このスペクトルの周波数軸ｆに対応
させて、離散的に複数の符号コードを定義する。この例
では、符号コードとしてＭＩＤＩデータで利用されるノ
ートナンバーＮを用いており、Ｎ＝０〜１２７までの１
２８個の符号コードを定義している。ノートナンバーＮ
は、音符の音階を示すパラメータであり、たとえば、ノ
ートナンバーＮ＝６９は、ピアノの鍵盤中央の「ラ音
（Ａ３音）」を示しており、４４０Ｈｚの音に相当す
る。このように、１２８個のノートナンバーには、いず
れも所定の周波数が対応づけられるので、スペクトルの
周波数軸ｆ上の所定位置に、それぞれ１２８個のノート
ナンバーＮが離散的に定義されることになる。Next, a plurality of code codes are discretely defined corresponding to the frequency axis f of the spectrum. In this example, a note number N used in MIDI data is used as a code code, and 1 from N = 0 to 127 is used.
28 code codes are defined. Note number N
Is a parameter indicating a musical scale of a note. For example, a note number N = 69 indicates a “ra tone (A3 tone)” at the center of the keyboard of a piano, and corresponds to a sound of 440 Hz. As described above, since a predetermined frequency is associated with each of the 128 note numbers, 128 note numbers N are discretely defined at predetermined positions on the frequency axis f of the spectrum. Become.

【００２７】ここで、ノートナンバーＮは、１オクター
ブ上がると、周波数が２倍になる対数尺度の音階を示す
ため、周波数軸ｆに対して線形には対応しない。そこ
で、周波数軸ｆを対数尺度で表し、この対数尺度軸上に
ノートナンバーＮを定義した強度グラフを作成する。図
２(b) は、このようにして作成された単位区間ｄ１につ
いての強度グラフを示す。この強度グラフの横軸は、図
２(a) に示すスペクトルの横軸を対数尺度に変換したも
のであり、ノートナンバーＮ＝０〜１２７が等間隔にプ
ロットされている。一方、この強度グラフの縦軸は、図
２(a) に示すスペクトルの複素強度Ａを実効強度Ｅに変
換したものであり、各ノートナンバーＮの位置における
強度を示している。一般に、フーリエ変換によって得ら
れる複素強度Ａは、実数部Ｒと虚数部Ｉとによって表さ
れるが、実効強度Ｅは、Ｅ＝（Ｒ^２＋Ｉ^２）^１／２なる
演算によって求めることができる。Here, the note number N indicates a logarithmic scale in which the frequency doubles when the octave is increased by one octave, and thus does not correspond linearly to the frequency axis f. Therefore, the frequency axis f is represented by a logarithmic scale, and an intensity graph in which the note number N is defined on the logarithmic scale axis is created. FIG. 2B shows an intensity graph for the unit section d1 created in this way. The horizontal axis of this intensity graph is obtained by converting the horizontal axis of the spectrum shown in FIG. 2A to a logarithmic scale, and note numbers N = 0 to 127 are plotted at equal intervals. On the other hand, the vertical axis of this intensity graph is obtained by converting the complex intensity A of the spectrum shown in FIG. 2A into the effective intensity E, and indicates the intensity at the position of each note number N. Generally, the complex intensity A obtained by Fourier transform is represented by a real part R and an imaginary part I, but the effective intensity E can be obtained by an operation of E = (R ² + I ² ) ^1/2 .

【００２８】こうして求められた単位区間ｄ１の強度グ
ラフは、単位区間ｄ１に所属する発音波形Ｗに含まれる
周波数成分について、ノートナンバーＮ＝０〜１２７に
相当する各周波数成分の割合を実効強度として示すグラ
フということができる。そこで、この強度グラフに示さ
れている各実効強度に基いて、全１２８個のノートナン
バーの中からＰ個のノートナンバーを選択する。ここで
は、説明の便宜上、Ｐ＝４として、全１２８個の候補の
中から４個のノートナンバーを選択することにする。こ
の４個のノートナンバーに相当する周波数は、単位区間
ｄ１に所属する発音波形Ｗに含まれる周波数成分の代表
的な値を示す代表周波数になっている必要がある。別言
すれば、この４個のノートナンバーを同時に演奏するこ
とにより、単位区間ｄ１内の発音波形にできるだけ近い
音が再生されるようにする必要がある。このような要求
を満たすノートナンバーの選択方法のひとつは、図２
(b)に示す強度グラフにおいて実行強度Ｅの大きい順に
４個のノートナンバーを選択する方法である。単位区間
ｄ１内で実行強度Ｅの大きなノートナンバーを選択すれ
ば、一応、これら選択したノートナンバーによって、単
位区間ｄ１を代表させることができる。The intensity graph of the unit section d1 obtained in this manner shows the ratio of each frequency component corresponding to the note number N = 0 to 127 with respect to the frequency components included in the sound waveform W belonging to the unit section d1 as the effective intensity. It can be called a graph shown. Therefore, P note numbers are selected from a total of 128 note numbers based on each effective strength shown in the strength graph. Here, for convenience of explanation, it is assumed that P = 4 and four note numbers are selected from a total of 128 candidates. The frequencies corresponding to the four note numbers need to be representative frequencies indicating representative values of the frequency components included in the sound waveform W belonging to the unit section d1. In other words, by playing these four note numbers simultaneously, it is necessary to reproduce a sound as close as possible to the sounding waveform in the unit section d1. One method of selecting a note number that satisfies such requirements is shown in FIG.
This is a method of selecting four note numbers in descending order of the execution intensity E in the intensity graph shown in FIG. If a note number having a large execution strength E is selected in the unit section d1, the unit section d1 can be represented by the selected note number.

【００２９】図２(b) に示す例では、４個のノートナン
バーＮｐ（ｄ１，１），Ｎｐ（ｄ１，２），Ｎｐ（ｄ
１，３），Ｎｐ（ｄ１，４）が選択されている。このよ
うなノートナンバーが選択されたら、これら各ノートナ
ンバーについての実行強度の値も求められる。いま、図
２(b) に示す例において、選択された各ノートナンバー
の実行強度として、それぞれＥｐ（ｄ１，１），Ｅｐ
（ｄ１，２），Ｅｐ（ｄ１，３），Ｅｐ（ｄ１，４）が
求められたとすれば、結局、以下に示す４組の符号コー
ドによって、単位区間ｄ１の発音波形Ｗを近似的に表現
することができる。In the example shown in FIG. 2B, four note numbers Np (d1,1), Np (d1,2), Np (d
1,3) and Np (d1,4) are selected. When such a note number is selected, a value of the execution strength for each of these note numbers is also obtained. Now, in the example shown in FIG. 2B, the execution intensities of the selected note numbers are Ep (d1, 1) and Ep, respectively.
If (d1,2), Ep (d1,3), Ep (d1,4) are obtained, after all, the sound waveform W of the unit section d1 is approximately expressed by the following four sets of code codes. can do.

【００３０】符号コード１：Ｎｐ（ｄ１，１），Ｅｐ（ｄ１，１）符号コード２：Ｎｐ（ｄ１，２），Ｅｐ（ｄ１，２）符号コード３：Ｎｐ（ｄ１，３），Ｅｐ（ｄ１，３）符号コード４：Ｎｐ（ｄ１，４），Ｅｐ（ｄ１，４）以上、単位区間ｄ１についての処理について説明した
が、単位区間ｄ２，ｄ３，ｄ４，…についても、それぞ
れ別個に同様の処理が行われ、４組の符号コードによっ
て各単位区間の発音波形Ｗが近似的に表現されることに
なる。たとえば、単位区間ｄ２については、符号コード５：Ｎｐ（ｄ２，１），Ｅｐ（ｄ２，１）符号コード６：Ｎｐ（ｄ２，２），Ｅｐ（ｄ２，２）符号コード７：Ｎｐ（ｄ２，３），Ｅｐ（ｄ２，３）符号コード８：Ｎｐ（ｄ２，４），Ｅｐ（ｄ２，４）なる４組のデータ対が得られる。Code 1: Np (d1, 1), Ep (d1, 1) Code 2: Np (d1, 2), Ep (d1, 2) Code 3: Np (d1, 3), Ep ( d1,3) Code 4: Np (d1,4), Ep (d1,4) The processing for the unit section d1 has been described above. The same applies to the unit sections d2, d3, d4,. Is performed, and the sound waveform W of each unit section is approximately represented by the four sets of code codes. For example, for unit section d2, code code 5: Np (d2,1), Ep (d2,1) code code 6: Np (d2,2), Ep (d2,2) code code 7: Np (d2, 3), Ep (d2,3) Code code 8: Four data pairs of Np (d2,4), Ep (d2,4) are obtained.

【００３１】図１(c) に示す図は、上述した符号化によ
って得られる符号コード群の概念図である。ここでは、
１組の符号コードが１つの四分音符で示されており、各
符号コードが、発音波形Ｗの時間軸ｔに対応する位置に
配置されている。すなわち、破線で示す位置Ｋ１上に配
置された４個の音符１〜４は、単位区間ｄ１についての
上述の符号コード１〜４に相当し、破線で示す位置Ｋ２
上に配置された４個の音符５〜８は、単位区間ｄ２につ
いての上述の符号コード５〜８に相当する。ここでは、
各符号コードを収容するために４つのトラックＴ１〜Ｔ
４が定義されており、１つの単位区間についての４つの
符号コードは、これら４つのトラックＴ１〜Ｔ４に分離
されて収容されている。上述の例では、各単位区間の区
間長は共通であるため、音符はいずれも同一音長の四分
音符になっているが、音程はそれぞれのノートナンバー
に応じて異なっている。なお、図１(c) に示す音符によ
る表現では、強度の情報は示されていないが、各音符で
示された符号コードは、それぞれ固有の強度の情報をも
っている。たとえば、図１(c) のトラックＴ１上に配さ
れた音符２で表現される符号コード２は、上述したよう
に、ノートナンバーＮｐ（ｄ１，２）および強度Ｅｐ
（ｄ１，２）を示す符号コードである。FIG. 1C is a conceptual diagram of a code group obtained by the above-described coding. here,
One set of code codes is indicated by one quarter note, and each code code is arranged at a position corresponding to the time axis t of the sound generation waveform W. That is, the four notes 1-4 arranged on the position K1 shown by the broken line correspond to the above-mentioned code codes 1 to 4 for the unit section d1, and the position K2 shown by the broken line
The four notes 5 to 8 arranged above correspond to the above-described code codes 5 to 8 for the unit section d2. here,
Four tracks T1 to T to accommodate each code code
4 are defined, and four code codes for one unit section are separately accommodated in these four tracks T1 to T4. In the above example, since the unit lengths of the unit sections are common, the notes are all quarter notes of the same note length, but the pitch differs according to each note number. In the expression using musical notes shown in FIG. 1 (c), information on the strength is not shown, but the code code indicated by each musical note has information on its own strength. For example, the code code 2 represented by the note 2 arranged on the track T1 in FIG. 1C has a note number Np (d1, 2) and an intensity Ep as described above.
This is a code representing (d1, 2).

【００３２】また、ここに示す例では、４つのトラック
に収容する際に、４つの符号コードに対して周波数に基
づくソートを行っており、最も周波数の高い符号コード
（別言すれば、最も大きなノートナンバー）をトラック
Ｔ１に収容し、最も周波数の低い符号コード（別言すれ
ば、最も小さなノートナンバー）をトラックＴ４に収容
するようにしている。より具体的に説明すれば、図２
(b) に示されている４つのノートナンバーの中で、最も
周波数の低いＮｐ（ｄ１，１）およびその強度Ｅｐ（ｄ
１，１）からなる符号コード１は、図１(c) のトラック
Ｔ４に収容され、最も周波数の高いＮｐ（ｄ１，２）お
よびその強度Ｅｐ（ｄ１，２）からなる符号コード２
は、図１(c) のトラックＴ１に収容されている。結局、
トラックＴ１には高音側の符号コードが収容され、トラ
ックＴ４には低音側の符号コードが収容されることにな
る。In the example shown here, when four tracks are accommodated in four tracks, the four code codes are sorted based on the frequency, and the code code having the highest frequency (in other words, the largest code code) is used. The track number T1 is stored in the track T1, and the code code having the lowest frequency (in other words, the lowest note number) is stored in the track T4. More specifically, FIG.
Among the four note numbers shown in (b), Np (d1,1) having the lowest frequency and its intensity Ep (d
1 (1) is accommodated in the track T4 of FIG. 1 (c), and the code code 2 composed of Np (d1,2) having the highest frequency and its intensity Ep (d1,2).
Are stored in the truck T1 in FIG. 1 (c). After all,
The track T1 contains the code code on the high note side, and the track T4 contains the code code on the low note side.

【００３３】このようにして、図１(a) に示すような特
定の音素「サ」に対して、図１(c)に示すような符号コ
ード群が定義されることになる。本発明における符号コ
ード定義処理とは、このように、特定の音素に対して、
図１(c) に示すような所定の符号コード群を定義する処
理ということができる。ここでは、１つの音素「サ」に
ついての符号コード定義処理を示したが、同様の処理が
必要なすべての音素（日本語の場合、アイウエオ…の５
０音および濁音，拗音，促音など）について行われ、各
音素についてそれぞれ固有の符号コード群が定義され
る。このようにして、各音素について定義された符号コ
ード群を記録したテーブルを、本明細書では「音素符号
テーブル」と呼ぶことにする。要するに、図１(c) に示
すような符号コード群を、複数の音素について用意して
記録したものが、音素符号テーブルである。In this way, a group of code codes as shown in FIG. 1 (c) is defined for a specific phoneme "sa" as shown in FIG. 1 (a). The code code definition processing in the present invention is, as described above, for a specific phoneme,
This can be said to be a process of defining a predetermined code group as shown in FIG. Here, the code code definition processing for one phoneme “sa” is shown, but all phonemes requiring similar processing (for Japanese, 5
0 sound and muddy sound, relentless sound, prompting sound, etc.), and a unique code code group is defined for each phoneme. The table in which the code codes defined for each phoneme are recorded in this manner is referred to as a “phoneme code table” in this specification. In short, a phoneme code table is prepared by recording a code group as shown in FIG. 1C for a plurality of phonemes.

【００３４】なお、各音素についてそれぞれ符号コード
群を定義するためには、各音素について実際の人間が発
音した発音波形を用意する必要があるが、各発音波形と
しては、できるだけ基準音長および基準音程で発音した
ものを用いるのが好ましい。具体的には、アナウンサー
などの発音訓練を積んだ者に、同じ音長、同じ音程で、
アイウエオ…の５０音および濁音，拗音，促音などから
なる必要な全音素についての発音を行ってもらい、これ
を各音素ごとの基準発音波形として取り込み、この基準
発音波形に基づいて各音素ごとの符号コード群を定義す
るようにすればよい。後述するように、音強・音長・音
程は補正することが可能であるため、アナウンサーに発
音してもらうときの音長や音程の精度はそれほど厳密で
ある必要はない。また、外国語にみられるような子音音
素など、単体の音素を自然に発音してもらうのが難しい
場合は、適当な単語を発音してもらい、音素に相当する
部分を信号波形上で切り出す方法も採れる。In order to define a code group for each phoneme, it is necessary to prepare a sounding waveform generated by an actual person for each phoneme. It is preferable to use one that is pronounced at intervals. Specifically, to those who have trained pronunciation such as announcer, with the same pitch and pitch,
We asked them to produce all the phonemes that consisted of the 50 sounds of Aiueo and all of the necessary phonemes, such as the muddy, relentless, and consonant sounds, and fetched them as reference phonetic waveforms for each phoneme. What is necessary is just to define a code group. As will be described later, since the tone strength, tone length, and pitch can be corrected, the accuracy of the tone length and pitch at the time of having the announcer generate the tone need not be so strict. If it is difficult to pronounce a single phoneme naturally, such as a consonant phoneme found in a foreign language, ask the appropriate words to be pronounced and cut out the part corresponding to the phoneme on the signal waveform. Can also be taken.

【００３５】こうして音素符号テーブル上に用意された
符号コード群を復号化すれば、もとの音素の発音波形を
近似的に再生することができる。たとえば、図１(c) に
示すような４トラックのＭＩＤＩデータを同時に再生
し、４チャンネルの再生音を得るようにすれば、もとの
音素「サ」の発音波形に近い音を再現することが可能に
なる。再生音をできるだけ人間の音声に近付けるために
は、人間の音声に基く音響波形をもったＭＩＤＩ音源
（一般に、ＶＯＩＣＥと呼ばれている音源）を用いて再
生を行うのが好ましい。By decoding the code group prepared on the phoneme code table in this way, it is possible to approximately reproduce the sounding waveform of the original phoneme. For example, by reproducing four tracks of MIDI data simultaneously as shown in FIG. 1 (c) and obtaining a reproduced sound of four channels, a sound close to the sounding waveform of the original phoneme "sa" can be reproduced. Becomes possible. In order to make the reproduced sound as close as possible to human voice, it is preferable to perform reproduction using a MIDI sound source (a sound source generally called VOICE) having an acoustic waveform based on human voice.

【００３６】なお、本発明における符号化の形式として
は、必ずしもＭＩＤＩ形式を採用する必要はないが、こ
の種の符号化形式としてはＭＩＤＩ形式が最も普及して
いるため、実用上はＭＩＤＩ形式の符号データを用いる
のが最も好ましい。ＭＩＤＩ形式では、「ノートオン」
データもしくは「ノートオフ」データが、「デルタタイ
ム」データを介在させながら存在する。「ノートオン」
データは、特定のノートナンバーＮとベロシティーＶと
を指定して特定の音の演奏開始を指示するデータであ
り、「ノートオフ」データは、特定のノートナンバーＮ
とベロシティーＶとを指定して特定の音の演奏終了を指
示するデータである。また、「デルタタイム」データ
は、所定の時間間隔を示すデータである。ベロシティー
Ｖは、たとえば、ピアノの鍵盤などを押し下げる速度
（ノートオン時のベロシティー）および鍵盤から指を離
す速度（ノートオフ時のベロシティー）を示すパラメー
タであり、特定の音の演奏開始操作もしくは演奏終了操
作の強さを示すことになる。It is not always necessary to adopt the MIDI format as the encoding format in the present invention. However, since the MIDI format is the most widespread as this type of encoding format, the MIDI format is practically used. Most preferably, code data is used. In MIDI format, "Note On"
Data or "note-off" data exists with "delta time" data interposed. "Note on"
The data is data that designates a specific note number N and a velocity V to instruct the start of performance of a specific sound, and the “note off” data is a specific note number N
And data indicating the end of the performance of a specific sound by designating the velocity and the velocity V. The “delta time” data is data indicating a predetermined time interval. Velocity V is a parameter indicating, for example, the speed at which a piano keyboard or the like is depressed (velocity at the time of note-on) and the speed at which a finger is released from the keyboard (velocity at the time of note-off). Or it indicates the strength of the performance end operation.

【００３７】本実施形態では、上述したように、第ｉ番
目の単位区間ｄｉについて、４個のノートナンバーＮｐ
（ｄｉ，１），Ｎｐ（ｄｉ，２），Ｎｐ（ｄｉ，３），
Ｎｐ（ｄｉ，４）が得られ、このそれぞれについて実効
強度Ｅｐ（ｄｉ，１），Ｅｐ（ｄｉ，２），Ｅｐ（ｄ
ｉ，３），Ｅｐ（ｄｉ，４）が得られる。そこで本実施
形態では、次のような手法により、ＭＩＤＩ符号データ
を作成している。まず、「ノートオン」データもしくは
「ノートオフ」データの中で記述するノートナンバーＮ
としては、得られたノートナンバーＮｐ（ｄｉ，１），
Ｎｐ（ｄｉ，２），…をそのまま用いている。一方、
「ノートオン」データもしくは「ノートオフ」データの
中で記述するベロシティーＶとしては、得られた実効強
度Ｅｐ（ｄｉ，１），Ｅｐ（ｄｉ，２），…を、値が０
〜１の範囲となるように規格化し、この規格化後の実効
強度Ｅの平方根に１２７を乗じた値を用いている。すな
わち、実効強度Ｅについての最大値をＥmax とした場
合、Ｖ＝（Ｅ／Ｅmax ）^１／２・１２７なる演算で求まる値Ｖをベロシティーとして用いてい
る。あるいは対数をとって、Ｖ＝ｌｏｇ（Ｅ／Ｅmax ）・１２７＋１２７（ただし、Ｖ＜０の場合はＶ＝０とする）なる演算で求
まる値Ｖをベロシティーとして用いてもよい。また、
「デルタタイム」データは、各単位区間の長さに応じて
設定すればよい。In this embodiment, as described above, for the i-th unit section di, four note numbers Np
(Di, 1), Np (di, 2), Np (di, 3),
Np (di, 4) are obtained, and for each of them, the effective intensities Ep (di, 1), Ep (di, 2), Ep (d)
i, 3) and Ep (di, 4) are obtained. Therefore, in the present embodiment, MIDI code data is created by the following method. First, note number N described in "note-on" data or "note-off" data
Is the obtained note number Np (di, 1),
Np (di, 2),... Are used as they are. on the other hand,
As the velocity V described in the “note-on” data or the “note-off” data, the obtained effective intensities Ep (di, 1), Ep (di, 2),.
The value is obtained by multiplying 127 by the square root of the effective intensity E after the standardization. That is, assuming that the maximum value of the effective intensity E is Emax, the value V obtained by the calculation of V = (E / Emax) ^1/2 · 127 is used as the velocity. Alternatively, a value V obtained by a calculation of V = log (E / Emax) .127 + 127 (when V <0, V = 0) may be used as the velocity. Also,
The “delta time” data may be set according to the length of each unit section.

【００３８】なお、ここでは説明の便宜上、１つの単位
区間について４つの符号コードを定義し、これを４つの
トラックに分離配置して、再生時には４チャンネルステ
レオ再生が行われるようにしているが、ＭＩＤＩ符号デ
ータの再生機能をもった一般的な装置は、８チャンネル
あるいは１６チャンネルのステレオ再生を行うことが可
能である。したがって、実用上は、１つの単位区間につ
いて８個あるいは１６個の符号コードを定義し、これを
８トラックあるいは１６トラックに分離配置するのが好
ましい。Here, for convenience of explanation, four code codes are defined for one unit section, and these are separately arranged in four tracks so that 4-channel stereo reproduction is performed at the time of reproduction. A general device having a function of reproducing MIDI encoded data can perform stereo reproduction of 8 channels or 16 channels. Therefore, in practice, it is preferable to define eight or sixteen code codes for one unit section and to separately arrange them on eight tracks or sixteen tracks.

【００３９】また、音素符号テーブルに記録されている
符号コードは、必要に応じて統合することが可能であ
る。すなわち、図１(c) に示す符号コード群において、
同一トラック上で同じ音程の音符が連続して配置されて
いる場合には、これを統合することにより音符の数を低
減させることができる。たとえば、トラックＴ１上に配
置されている音符２と音符７は、いずれも同じ音程であ
るから、両者を二分音符に統合することができる。な
お、実際の符号データには、音程の情報（ノートナンバ
ー）だけでなく、強度の情報も含まれているが、強度が
異なっている場合でも、音程が同じであれば統合しても
大きな問題は生じない。強度が異なっていた場合には、
たとえば、大きい方の強度を統合した符号データの強度
として採用するなどの処理を行えばよい。要するに、隣
接する複数の単位区間について、所定の条件下で互いに
類似する符号コードがある場合、これら類似する符号コ
ードを、複数の単位区間に跨がった統合符号コードに置
換する処理を行うことが可能である。The code codes recorded in the phoneme code table can be integrated as needed. That is, in the code group shown in FIG.
When notes of the same pitch are continuously arranged on the same track, by integrating them, the number of notes can be reduced. For example, note 2 and note 7 arranged on track T1 have the same pitch, so that both can be integrated into a half note. Note that the actual code data includes not only pitch information (note number) but also strength information. However, even if the strengths are different, if the pitches are the same, there is a great problem in integrating them. Does not occur. If the strength was different,
For example, a process of adopting the larger intensity as the integrated code data intensity may be performed. In short, when there are code codes similar to each other under a predetermined condition for a plurality of adjacent unit sections, a process of replacing these similar code codes with an integrated code code extending over the plurality of unit sections is performed. Is possible.

【００４０】§２．母音と子音に基づく分類上述した§１の手順によって、音素符号テーブルを用意
できれば、一応、所望の音素を再生することはできる。
しかしながら、後述する§３の手順において、任意の音
素を、任意の音長および任意の音程で再生できるように
するためには、各音素について定義された符号コード群
を、母音に寄与する符号コードと子音に寄与する符号コ
ードとに分類して音素符号テーブルに記録しておくよう
にするのが好ましい。たとえば、図１(c) には、音素
「サ」に対応する符号コード群が示されているが、ここ
に示された各符号コードは、主として母音に寄与するも
のと、主として子音に寄与するものとに分類することが
できる。別言すれば、図１(a) に示す音素「サ」の発音
波形Ｗには、母音成分と子音成分とが含まれているた
め、図１(c) に示された各符号コードにも、母音成分に
基づいて定義されたものと、子音成分に基づいて定義さ
れたものとが混在しているのである。各符号コードを、
母音に寄与するものと子音に寄与するものとに分類して
おくメリットは、後述する§３で述べることにするが、
定義した各符号コードを音素符号テーブルに記録する際
には、個々の符号コードごとにいずれの分類に属するか
を示す情報を付加しておくのが好ましい。 §2. Classification Based on Vowels and Consonants If a phoneme code table can be prepared by the procedure of §1 described above, a desired phoneme can be reproduced for the time being.
However, in order to be able to reproduce an arbitrary phoneme at an arbitrary pitch and an arbitrary pitch in the procedure of §3 described later, a code code group defined for each phoneme is replaced with a code code contributing to a vowel. It is preferable to classify into a code code contributing to a consonant and record it in a phoneme code table. For example, FIG. 1 (c) shows a group of code codes corresponding to the phoneme "sa". Each code code shown here mainly contributes to vowels and mainly contributes to consonants. Can be classified as In other words, since the pronunciation waveform W of the phoneme "sa" shown in FIG. 1A includes a vowel component and a consonant component, each code code shown in FIG. That is, the one defined based on the vowel component and the one defined based on the consonant component are mixed. Each code is
The merit of classifying into those contributing to vowels and those contributing to consonants will be described later in §3.
When recording each defined code code in the phoneme code table, it is preferable to add information indicating which classification belongs to each code code.

【００４１】日本語の一般的な音素では、「子音」＋
「母音」という形式をとるので、子音の発音時期の方が
母音の発音時期よりも先行し、時間軸上で「子音部分」
と「母音部分」とが分離できるようにも思えるが、本願
発明者が行った実験によると、「子音部分」と「母音部
分」とを時間軸上で分離することは非常に困難であるこ
とがわかった。たとえば、音素「サ」は、子音「Ｓ」と
母音「Ａ」との組み合わせで構成されているが、図１
(c) に示すような符号コード群において、たとえば、先
行する単位区間に対応する位置Ｋ１〜Ｋ３に配置された
符号コードは子音「Ｓ」に寄与し、後続する単位区間に
対応する位置Ｋ４〜Ｋ９に配置された符号コードは母音
「Ａ」に寄与する、といった時間軸に基づく分類を行っ
ても、実際には正しい分類にはならないことが判明し
た。In a general Japanese phoneme, "consonant" +
Since it takes the form of "vowel", the sounding time of the consonant precedes the sounding time of the vowel, and the "consonant part" on the time axis
It seems that the "vowel part" can be separated from the "vowel part", but according to the experiment performed by the present inventor, it is very difficult to separate the "consonant part" and the "vowel part" on the time axis. I understood. For example, the phoneme “sa” is composed of a combination of a consonant “S” and a vowel “A”.
In the code code group shown in (c), for example, code codes arranged at positions K1 to K3 corresponding to the preceding unit section contribute to the consonant "S", and positions K4 to K4 corresponding to the succeeding unit section. It has been found that even if a classification based on a time axis such that the code code arranged in K9 contributes to the vowel "A" is performed, the classification is not actually correct.

【００４２】本願発明者が行った実験によれば、時間軸
に基づく分類よりも、むしろ周波数の高低に基づく分類
を行った方が、より正確な分類が可能になる。本発明を
実施する上での最も単純な分類方法は、この周波数の高
低に基づく分類方法である。一般に、母音は周波数成分
が低く、子音は周波数成分が高くなる。したがって、周
波数の低い符号コード（小さいノートナンバー）につい
ては、母音に寄与する符号コードとし、周波数の高い符
号コード（大きいノートナンバー）については、子音に
寄与する符号コードとする大まかな分類が可能である。
図１(c) に示した符号コード群は、前述したように、周
波数に基づくソートが行われているため、トラックＴ１
には周波数の高い符号コードが収容され、トラックＴ４
には周波数の低い符号コードが収容されている。したが
って、周波数の高低に基づく分類法を適用するのであれ
ば、たとえば、図４に示すように、トラックＴ４に収容
された符号コードはすべて母音に寄与するコードであ
り、トラックＴ１〜Ｔ３に収容された符号コードは、す
べて子音に寄与するコードである、との単純な分類を行
うことができる。According to an experiment conducted by the inventor of the present application, more accurate classification can be performed by performing classification based on the level of frequency rather than classification based on the time axis. The simplest classification method for implementing the present invention is a classification method based on the level of the frequency. In general, vowels have low frequency components and consonants have high frequency components. Therefore, it is possible to roughly classify low-frequency code codes (small note numbers) as code codes contributing to vowels, and high-frequency code codes (high note numbers) as code codes contributing to consonants. is there.
The code group shown in FIG. 1 (c) is sorted based on the frequency as described above, so that the track T1
Contains a high-frequency code code, and the track T4
Contains a low frequency code code. Therefore, if a classification method based on the level of the frequency is applied, for example, as shown in FIG. 4, the code codes stored in the track T4 are all codes contributing to vowels, and are stored in the tracks T1 to T3. A simple classification can be made that all the code codes are codes that contribute to consonants.

【００４３】本願発明者が行った実験によると、８トラ
ックから構成される符号コード群を定義した場合、母音
と子音とからなる音素であれば、最も低周波側の１トラ
ックのみを母音に寄与する符号コードとし、残りの７ト
ラックを子音に寄与する符号コードとすれば、比較的正
確な分類が可能になる。According to an experiment conducted by the inventor of the present invention, when a code code group composed of eight tracks is defined, if a phoneme is composed of vowels and consonants, only one track on the lowest frequency side contributes to vowels. If the remaining seven tracks are code codes that contribute to consonants, relatively accurate classification can be performed.

【００４４】なお、「ア」，「イ」，「ウ」のように、
母音のみからなる音素について得られた符号コード群に
ついては、全トラックに収容された全符号コードを母音
に寄与する符号コードとすればよいし、逆に、子音のみ
からなる音素（日本語には存在しない）について得られ
た符号コード群については、全トラックに収容された全
符号コードを子音に寄与する符号コードとすればよい。Incidentally, like "A", "I", "U",
For code groups obtained for phonemes consisting only of vowels, all code codes contained in all tracks may be used as code codes contributing to vowels, and conversely, phonemes consisting only of consonants (for Japanese, With regard to the code code group obtained for (nonexistent), all code codes contained in all tracks may be used as code codes that contribute to consonants.

【００４５】もっとも、上述した周波数に基づく分類方
法は、分類作業が単純な点においては利用価値のある方
法であるが、この分類方法を画一的に適用した場合、必
ずしも正しい分類結果が得られない場合もある。たとえ
ば、図３に示すように、４つのトラックＴ１〜Ｔ４にそ
れぞれ符号コードが収容されており（図示の便宜上、各
音符の音程は省略して示してある）、実際には、丸印を
つけた符号コードが母音に寄与し、それ以外の符号コー
ドが子音に寄与するような場合を考える。上述した分類
方法を採ると、このような場合でも、画一的に、トラッ
クＴ４に収容されている符号コードはすべて母音に寄与
するものとされ、トラックＴ１〜Ｔ３に収容されている
符号コードはすべて子音に寄与するものとされてしま
う。Although the above-described frequency-based classification method is useful in that the classification operation is simple, correct classification results are not always obtained when this classification method is applied uniformly. Not always. For example, as shown in FIG. 3, a code code is stored in each of four tracks T1 to T4 (the pitch of each note is omitted for convenience of illustration). Let us consider a case in which the encoded code contributes to a vowel and the other encoded code contributes to a consonant. According to the above-described classification method, even in such a case, the code codes contained in the track T4 are all assumed to contribute to the vowels, and the code codes contained in the tracks T1 to T3 are uniform. All are considered to contribute to consonants.

【００４６】本願発明者は、種々の発音波形についての
符号コード群を定義する実験を繰り返してゆくうちに、
より正確な分類に利用できる法則を発見した。すなわ
ち、同一の音素について、音長を変えて発音したり、音
程を変えて発音したりして、複数の発音波形を作成し、
これら各発音波形についてそれぞれ§１で述べた手法に
より符号コード群を定義した場合、子音に寄与する符号
コードについては変化が少ないのに対し、母音に寄与す
る符号コードについては大きな変化がみられる。別言す
れば、音長の変化や音程の変化は、主として母音に現れ
るのである。したがって、音長や音程を変えた場合で
も、変化の少ない符号コードは子音に寄与する符号コー
ドと考えることができ、逆に、音長や音程を変えること
により大きく変化する符号コードは母音に寄与する符号
コードと考えることができる。As the present inventor repeated experiments for defining code groups for various sound waveforms,
A rule that can be used for more accurate classification has been found. In other words, for the same phoneme, the pitch is changed or the pitch is changed, and the pitch is changed to produce a plurality of pronunciation waveforms.
When a code code group is defined for each of these sounding waveforms by the method described in §1, the code code contributing to the consonant has little change, while the code code contributing to the vowel has a large change. In other words, changes in pitch and pitch occur mainly in vowels. Therefore, even when the pitch or pitch is changed, a code code with little change can be considered to be a code code that contributes to a consonant, while a code code that greatly changes by changing the pitch or pitch contributes to a vowel. Can be considered as a code code.

【００４７】この性質を利用すれば、より正確な分類が
可能である。たとえば、音素「サ」について、アナウン
サーに種々の音長あるいは種々の音程で発音してもら
い、複数通りの発音波形を取り込む。このとき、音長や
音程の変化の度合いが豊かであることが望ましいが、音
程を等間隔に変化させるなどの高精度の試行は全く必要
としない。そして、この複数通りの発音波形のそれぞれ
について符号コード群を作成し、作成された複数通りの
符号コード群の中で、所定の許容範囲内の変化しか示さ
ない符号コードを子音に寄与する符号コードとし、残り
の符号コードを母音に寄与する符号コードとすればよい
のである。この分類方法を採れば、図３に示すように母
音／子音が分かれていた場合にも、かなり正確な分類が
可能になる。すなわち、図３の例の場合、発音時の音長
や音程を変えると、丸印が付されていない音符について
は音程や強度に大きな変化はみられないのに対し、丸印
が付されている音符については音程や強度に大きな変化
がみられることになる。If this property is used, more accurate classification is possible. For example, for the phoneme "sa", the announcer sounds at various pitches or at various pitches, and a plurality of different sounding waveforms are captured. At this time, it is desirable that the degree of change of the pitch and the pitch is rich, but a highly accurate trial of changing the pitch at equal intervals is not required at all. Then, a code code group is created for each of the plurality of tone waveforms, and among the created code code groups, a code code showing only a change within a predetermined allowable range is contributing to a consonant. The remaining code codes may be used as code codes that contribute to vowels. If this classification method is adopted, fairly accurate classification becomes possible even when vowels / consonants are separated as shown in FIG. That is, in the case of the example of FIG. 3, if the pitch or pitch at the time of sounding is changed, the pitch or intensity of a note without a circle is not significantly changed, whereas a note with a circle is added. For notes that are present, there will be significant changes in pitch and intensity.

【００４８】以上のような分類方法を採ることにより、
本発明の前段部分の手順によって得られる音素符号テー
ブル上に用意された各音素ごとの符号コード群は、母音
に寄与する符号コードと子音に寄与する符号コードとに
分類できることになる。By adopting the above classification method,
The code group for each phoneme prepared on the phoneme code table obtained by the procedure of the preceding part of the present invention can be classified into code codes that contribute to vowels and code codes that contribute to consonants.

【００４９】§３．本発明の後段部分の手順既に述べたように、本発明の前段部分の手順により、音
素符号テーブルが用意される。本発明の後段部分の手順
は、この音素符号テーブルを利用して、特定の音素に対
応する人間の声を合成する手順である。 §3. Procedure of the latter part of the present invention As described above, the phoneme code table is prepared by the procedure of the former part of the present invention. The procedure in the latter part of the present invention is a procedure for synthesizing a human voice corresponding to a specific phoneme using the phoneme code table.

【００５０】特定の音素に対応する人間の声を合成した
い場合、基本的には、合成対象となる音素を特定した合
成指示データを用意し、この合成指示データに基づい
て、音素符号テーブルから合成対象となる音素について
定義された符号コード群を抽出し、抽出した符号コード
群を所定の音源を用いて再生すればよい。たとえば、
「サクラが咲いた」なる言葉を合成する場合であれば、
「サクラガサイタ」という７つの音素を特定する合成指
示データを用意し、この合成指示データに基づいて、ま
ず、音素「サ」について定義された符号コード群を音素
符号テーブルから抽出して再生し、続いて、音素「ク」
について定義された符号コード群を音素符号テーブルか
ら抽出して再生し、…というように、音素ごとに順次再
生を行えばよい。When it is desired to synthesize a human voice corresponding to a specific phoneme, basically, synthesis instruction data that specifies a phoneme to be synthesized is prepared, and a synthesis instruction data is synthesized from a phoneme code table based on the synthesis instruction data. A code group defined for the target phoneme may be extracted, and the extracted code group may be reproduced using a predetermined sound source. For example,
In the case of synthesizing the words "Sakura has bloomed"
Synthetic instruction data specifying seven phonemes “Sakura Gasita” is prepared. Based on the synthetic instruction data, first, a code code group defined for the phoneme “sa” is extracted from the phoneme code table and reproduced. And the phoneme "ku"
May be extracted from the phoneme code table and reproduced, and... May be sequentially reproduced for each phoneme.

【００５１】しかしながら、このように音素だけを特定
する合成指示データを用いた場合、いわゆる「棒読み」
の状態で「サクラガサイタ」という音声が再生されるに
過ぎず、人間の話し声特有の抑揚を再現することはでき
ない。前述したように、音素符号テーブルを用意する際
には、アナウンサーなどの発音訓練を積んだ者に、同じ
音長、同じ音程で、アイウエオ…の５０音および濁音，
拗音，促音などからなる必要な全音素についての発音を
行ってもらい、この基準発音波形に基づいて各音素の符
号コードを定義しているので、音素符号テーブルから抽
出した符号コード群をそのまま再生すると、抑揚の全く
ない不自然な音声合成しかできない。However, when the synthesis instruction data for specifying only phonemes is used, a so-called “stick reading” is used.
In this state, only the sound of "Sakura Gasita" is reproduced, and the intonation peculiar to human speech cannot be reproduced. As described above, when preparing the phoneme code table, a person who has trained pronunciation such as an announcer can give the same pitch and the same pitch the 50 sounds of Aiou ...
We asked all the phonemes that consist of repetitive sounds, prompting sounds, etc. to be pronounced, and defined the code codes for each phoneme based on this reference pronunciation waveform. It can only perform unnatural speech synthesis without any intonation.

【００５２】本実施形態では、合成指示データにおいて
音素、音長、音程、強度を指定するようにし、任意の音
素を、任意の音長、任意の音程、任意の強度で再生でき
るようにしている。これにより、自由度の高い音声合成
ができるようになる。すなわち、本実施形態では、１つ
の音素を合成するために与える合成指示データとして、
図５に示すような４つのコードからなるフォーマットを
用いている。音素コードは、合成対象となる音素を特定
するためのコードであり、たとえば、ＡＳＣＩＩコード
やＪＩＳコードなどの文字コードを流用することができ
る。音長コードは、合成対象となる音素を再生するとき
の長さを特定するためのコードであり、たとえば、実時
間の単位sec を用いた数値をそのまま用いることができ
る。音程コードは、合成対象となる音素を再生するとき
の音程を特定するためのコードであり、たとえば、音楽
で利用されている一般的な音階コードをそのまま利用す
ることができる。また、強度コードは、合成対象となる
音素を再生するときの音の強さを特定するためのコード
であり、たとえば、再生音の相対的な振幅を示す数値を
用いることができる。In the present embodiment, a phoneme, a tone length, a pitch, and an intensity are designated in the synthesis instruction data, so that an arbitrary phoneme can be reproduced at an arbitrary pitch, an arbitrary pitch, and an arbitrary intensity. . This allows speech synthesis with a high degree of freedom. That is, in the present embodiment, as synthesis instruction data given to synthesize one phoneme,
A format including four codes as shown in FIG. 5 is used. The phoneme code is a code for specifying a phoneme to be synthesized, and for example, a character code such as an ASCII code or a JIS code can be used. The tone length code is a code for specifying the length when reproducing the phoneme to be synthesized. For example, a numerical value using a unit of real time sec can be used as it is. The pitch code is a code for specifying a pitch when a phoneme to be synthesized is reproduced, and for example, a general scale code used in music can be used as it is. Further, the intensity code is a code for specifying the intensity of the sound when the phoneme to be synthesized is reproduced, and for example, a numerical value indicating the relative amplitude of the reproduced sound can be used.

【００５３】図６は、図５に示すフォーマットを用いて
記述された具体的な合成指示データの一例を示す図であ
る。この例では、「サクラサクラ」なる６つの音素から
なる歌声が合成されることになる。６つの音素について
は、図示のとおり、音長コード、音程コード、強度コー
ドが様々に設定されており、メロディーやイントネーシ
ョンをもった歌声の合成が行われることになる。FIG. 6 is a diagram showing an example of specific combining instruction data described using the format shown in FIG. In this example, a singing voice composed of six phonemes “Sakura Sakura” is synthesized. For the six phonemes, as shown in the figure, a pitch code, a pitch code, and an intensity code are set variously, and a singing voice having a melody or intonation is synthesized.

【００５４】この図６に示すような具体的な合成指示デ
ータが与えられた場合、実際には、次のような方法で合
成された歌声を再生するようにすればよい。はじめに、
図６の第１行目に記述された各コードに基づいて、音素
「サ」を、１sec の音長、Ｃ３の音程、５の強度で、次
のような手順で再生する。まず、「サ」の文字コードに
基づいて、音素符号テーブルを参照し、音素「サ」につ
いて定義されている符号コード群を抽出してくる。この
符号コード群は、たとえば、図３に示すように複数のト
ラックに収容されており、しかも各符号コードごとに、
母音に寄与するコードであるか、子音に寄与するコード
であるかの分類がなされている。続いて、この抽出した
符号コード群について、音長，音程，強度の補正を行
う。すなわち、上述の例の場合、再生時の音長が１sec
、音程がＣ３、強度が８となるような補正を行えばよ
い。When specific synthesis instruction data as shown in FIG. 6 is given, a singing voice synthesized by the following method may be actually reproduced. First,
Based on each code described in the first line of FIG. 6, the phoneme "sa" is reproduced at a pitch of 1 sec, a pitch of C3, and an intensity of 5 in the following procedure. First, based on the character code of “sa”, a code group defined for the phoneme “sa” is extracted by referring to the phoneme code table. This code code group is accommodated in a plurality of tracks as shown in FIG. 3, for example.
Classification is made as to whether the chord contributes to a vowel or a consonant. Subsequently, the length, pitch, and intensity of the extracted code code group are corrected. That is, in the case of the above example, the sound length at the time of reproduction is 1 sec.
, The pitch should be C3 and the intensity should be 8.

【００５５】音長の補正は、ＭＩＤＩコードの場合、デ
ルタタイムを調整する処理ということになる。たとえ
ば、音素符号テーブルに記録されているもともとの符号
コード群が、０．５sec の音長でアナウンサーが発音し
た基準発音波形に基づいて定義されたものであった場
合、各符号コードを時間軸上で２倍に引き伸ばす補正処
理を行えばよい。ただし、この音長補正処理は、母音に
寄与する符号コードに対してのみ行うようにする。これ
は、既に述べたように、音長の変化は、主として母音に
現れるという性質があるため、子音に寄与する符号コー
ドに対しても音長補正処理を施すと、不自然な発音波形
が合成されることになるからである。したがって、たと
えば図３に示す例の場合、丸印を付した音符のみの音長
が補正されることになる。The correction of the tone length is a process of adjusting the delta time in the case of a MIDI code. For example, if the original code group recorded in the phoneme code table is defined based on the reference sounding waveform generated by the announcer with a tone length of 0.5 sec, each code code is defined on the time axis. In this case, it is sufficient to perform a correction process of expanding the image by two times. However, this tone length correction processing is performed only for code codes that contribute to vowels. This is because, as described above, since the change in tone length has the property of mainly appearing in vowels, if tone length correction processing is applied to code codes that contribute to consonants, an unnatural sounding waveform will be synthesized. Because it will be done. Therefore, for example, in the case of the example shown in FIG. 3, the note length of only the note with a circle is corrected.

【００５６】一方、音程の補正は、ＭＩＤＩコードの場
合、ノートナンバーを変更する処理ということになる。
たとえば、音素符号テーブルに記録されているもともと
の符号コード群が、Ｃ２の音程でアナウンサーが発音し
た基準発音波形に基づいて定義されたものであった場
合、音程コードによる指示は、１オクターブ上のＣ３の
音程であるから、全体的に各符号コードを１オクターブ
分引き上げる補正処理を行えばよい。ただし、この音程
補正処理も、母音に寄与する符号コードに対してのみ行
うようにする。これは、既に述べたように、音程の変化
は、主として母音に現れるという性質があるため、子音
に寄与する符号コードに対しても音程補正処理を施す
と、不自然な発音波形が合成されることになるからであ
る。したがって、たとえば図３に示す例の場合、丸印を
付した音符のみの音程が、それぞれ１オクターブ上にシ
フトされることになる。On the other hand, in the case of a MIDI code, the pitch correction is a process of changing a note number.
For example, if the original code group recorded in the phoneme code table is defined based on the reference tone waveform generated by the announcer at the pitch of C2, the instruction by the pitch code is one octave higher. Since the pitch is C3, a correction process for raising each code code by one octave as a whole may be performed. However, this interval correction process is also performed only for code codes that contribute to vowels. This is because, as described above, a change in pitch has the property of mainly appearing in a vowel, so that when pitch correction processing is performed on a code code contributing to a consonant, an unnatural sounding waveform is synthesized. Because it will be. Accordingly, for example, in the case of the example shown in FIG. 3, the pitches of only the notes with circles are shifted upward by one octave.

【００５７】また、強度の補正は、ＭＩＤＩコードの場
合、ベロシティーを変更する処理ということになる。強
度コードに与える数値として、たとえば、５を標準値と
するように予め定めておけば、各符号コードのベロシテ
ィーを８／５倍にする補正を行えばよい。この強度補正
処理に関しては、母音に寄与する符号コードと子音に寄
与する符号コードとの区別なしに、すべての符号コード
に対して一律に実行すればよい。In the case of MIDI code, the correction of the intensity is a process of changing the velocity. If a numerical value to be given to the intensity code is predetermined to be, for example, 5 as a standard value, the correction may be performed so that the velocity of each code code is 8/5 times. This intensity correction process may be performed uniformly for all code codes without distinction between code codes contributing to vowels and code codes contributing to consonants.

【００５８】以上のような各補正を行った符号コードに
基づいて、所定の音源を用いた再生を行えば、図６の第
１行目に記述された合成指示データに基づいて、音素
「サ」の合成音が再生されることになる。第２行目〜第
６行目に記述された合成指示データに基づく合成音の再
生も全く同様に行われる。If reproduction is performed using a predetermined sound source based on the code codes subjected to the above-described corrections, based on the synthesis instruction data described in the first line of FIG. Will be reproduced. The reproduction of the synthesized sound based on the synthesis instruction data described in the second to sixth lines is performed in exactly the same manner.

【００５９】このように、本発明に係る音声合成方法に
よれば、合成指示データの内容を変えることにより、音
素、音長、音程、強度を自由に選択することができ、任
意の歌声を合成することが可能になる。英語の場合は、
文章の抑揚の相違で疑問文に変化するなど、文章に歌声
と同様の音程変化が要求されるが、個々の単語は強弱ア
クセントによる強度の制御だけでよい。一方、日本語の
場合、文章に抑揚は通常みられないが、「箸」と
「橋」、「雨」と「飴」のように、強弱のアクセントで
はなく、音程を変えることによって異なる単語を表現す
る場合が多く見られる。このような場合でも、本発明に
係る音声合成方法を用いれば、音程を考慮した正しい単
語合成を行うことが可能になる。As described above, according to the voice synthesizing method of the present invention, by changing the content of the synthesis instruction data, the phoneme, the pitch, the pitch, and the intensity can be freely selected, and an arbitrary singing voice can be synthesized. It becomes possible to do. For English,
The sentence is required to have the same pitch change as the singing voice, such as changing into a question sentence due to the difference in intonation of the sentence, but individual words need only be controlled in intensity by strong and weak accents. On the other hand, in the case of Japanese, sentences do not usually show inflection, but different words such as "chopsticks" and "bridges" and "rain" and "candy" are changed by changing the pitch instead of accents. Often expressed. Even in such a case, if the speech synthesis method according to the present invention is used, it is possible to perform correct word synthesis in consideration of the pitch.

【００６０】§４．本発明に係る音声合成装置最後に、上述した音声合成方法を実施するための装置構
成を、図７のブロック図を参照しながら簡単に述べてお
く。この図７に示す音声合成装置の主たる構成要素は、
データ入力部１０、音素符号テーブル２０、符号コード
補正部３０、音声再生部４０である。 §4. Speech Synthesis Apparatus According to the Present Invention Finally, an apparatus configuration for implementing the above-described speech synthesis method will be briefly described with reference to the block diagram of FIG. The main components of the speech synthesizer shown in FIG.
A data input unit 10; a phoneme code table 20; a code code correction unit 30;

【００６１】データ入力部１０は、合成対象となる特定
の音素、およびその音長，音程，強度を指示する合成指
示データＤを入力する機能を有する。この合成指示デー
タＤは、たとえば、図５に示すようなフォーマットで与
えられる。このようなフォーマットのデータを必要な音
素の数だけ用意すれば、連続した音素として、言葉や歌
詞を表現することができる。The data input unit 10 has a function of inputting synthesis instruction data D for specifying a specific phoneme to be synthesized and its pitch, pitch and intensity. The synthesis instruction data D is provided, for example, in a format as shown in FIG. If data of such a format is prepared for the required number of phonemes, words and lyrics can be expressed as continuous phonemes.

【００６２】一方、音素符号テーブル２０には、必要な
個々の音素について、基準音長および基準音程で発音し
た人間の発音波形を再現するための符号コード群が、母
音に寄与する符号コードと子音に寄与する符号コードと
に分類して記録されている。具体的には、アイウエオ…
の５０音および濁音，拗音，促音なるそれぞれの音素に
ついて、アナウンサーなどに、基準音長（たとえば、
０．５sec ）および基準音程（たとえば、Ｃ２音）で発
音してもらい、この発音波形に基づいて§１で述べた手
順により、各音素ごとに符号コード群を作成し、この作
成された符号コード群を音素符号テーブル２０として記
録すればよい。このとき、§２で述べた手法を用いて、
各符号コードについて、母音に寄与する符号コードか、
子音に寄与する符号コードかを併せて記録しておくよう
にする。On the other hand, the phoneme code table 20 includes, for each required phoneme, a code code group for reproducing a sounding waveform of a person who has sounded at a reference pitch and a reference pitch, a code code and a consonant that contribute to a vowel. Are recorded in the form of code codes contributing to. More specifically,
For each of the 50 phonemes, and the phonemes that are dull sounds, resounding sounds, and prompting sounds, the announcer etc. provide the reference pitch (for example,
0.5 sec) and a reference pitch (for example, C2 tone), and a code code group is created for each phoneme according to the procedure described in §1 based on the sound waveform. The group may be recorded as the phoneme code table 20. At this time, using the method described in §2,
For each code code, whether it is a code code that contributes to vowels,
A code code contributing to a consonant is also recorded.

【００６３】符号コード補正部３０は、データ入力部１
０が入力した合成指示データＤに基づいて、音素符号テ
ーブル２０から合成対象となる特定の音素についての符
号コード群を読出し、この読出した符号コード群のうち
の母音に寄与する符号コードについては、合成指示デー
タＤに基づいて音長および音程の補正を行う機能を有す
る。たとえば、図６に示すような具体的な合成指示デー
タが与えられた場合、§３において述べたように、ま
ず、第１行目に記述された音素コード「サ」に基づい
て、音素符号テーブル２０を参照し、音素「サ」につい
て定義されている符号コード群を抽出する。続いて、音
長コードおよび音程コードに基づいて、抽出した符号コ
ード群の中の母音に寄与する符号コードに対して、再生
時の音長が１sec 、音程がＣ３となるような補正が行わ
れる。更に、強度コードに基づいて、抽出した符号コー
ド群のすべてに対して、再生時の強度が８となるような
補正が行われる。ＭＩＤＩコードの場合、音長に関する
補正はデルタタイムの修正により行われ、音程に関する
補正はノートナンバーの修正により行われ、強度に関す
る補正はベロシティーの修正により行われる。The code code correcting section 30 is provided with the data input section 1
0 is read from the phoneme code table 20 based on the synthesis instruction data D input as a code code group for a specific phoneme to be synthesized, and code codes that contribute to vowels in the read code code group are: It has a function of correcting the pitch and pitch based on the synthesis instruction data D. For example, when specific synthesis instruction data as shown in FIG. 6 is given, as described in §3, first, based on the phoneme code “sa” described in the first row, the phoneme code table 20, a code code group defined for the phoneme “sa” is extracted. Subsequently, based on the pitch code and the pitch code, correction is performed on the code codes contributing to the vowels in the extracted code code group so that the pitch at the time of reproduction is 1 sec and the pitch is C3. . Further, based on the intensity codes, correction is performed on all the extracted code code groups so that the intensity at the time of reproduction becomes 8. In the case of the MIDI code, the correction for the tone length is performed by correcting the delta time, the correction for the pitch is performed by correcting the note number, and the correction for the intensity is performed by correcting the velocity.

【００６４】音声再生部４０は、所定の音源を用いて、
補正後の符号コード群を再生し、人間の声の合成音を発
声させる機能を有し、図示の例では、制御部４１，音源
部４２，アンプ部４３，スピーカ４４から構成されてい
る。制御部４１は、符号コード補正部３０から与えられ
る補正後の符号コード群に基づいて、基本的な再生処理
を行う部分である。音源部４２は、一般に「ＶＯＩＣ
Ｅ」と呼ばれている種類のＭＩＤＩ音源であり、人間の
音声に基づいて作成された音響波形を収容しており、所
定のＭＩＤＩデータに対応する音響波形を提供する機能
を有する。制御部４１は、与えられた符号コード群（こ
の例の場合、ＭＩＤＩデータ）に基づいて、順次、音源
部４２から必要な音響波形を読出し、再生音の合成を行
い、合成した音声をアナログ信号として出力する。この
アナログ信号は、アンプ部４３で増幅され、スピーカ４
４で音声として再生されることになる。The sound reproducing unit 40 uses a predetermined sound source to
It has a function of reproducing the corrected code code group and producing a synthesized voice of a human voice, and in the example shown in the figure, is composed of a control unit 41, a sound source unit 42, an amplifier unit 43, and a speaker 44. The control unit 41 is a part that performs basic reproduction processing based on the corrected code code group provided from the code code correction unit 30. The sound source section 42 generally has a “VOIC
This is a type of MIDI sound source called "E", which stores an acoustic waveform created based on human voice and has a function of providing an acoustic waveform corresponding to predetermined MIDI data. The control unit 41 sequentially reads out necessary acoustic waveforms from the sound source unit 42 based on a given code code group (in this case, MIDI data), synthesizes a reproduced sound, and converts the synthesized sound into an analog signal. Output as This analog signal is amplified by the amplifier unit 43 and the speaker 4
4 will be reproduced as audio.

【００６５】なお、図７に示す各構成要素において、符
号コード補正部３０は、汎用コンピュータに専用のソフ
トウエアを組み込むことによって実現でき、音素符号テ
ーブル２０は、このコンピュータによってアクセス可能
な記憶装置（メモリや種々の記憶媒体）によって実現で
きる。また、データ入力部１０は、このコンピュータに
対して合成指示データＤを入力するための種々の入力機
器によって実現でき、音声再生部４０は、コンピュータ
用のＭＩＤＩ機器によって実現できる。In each component shown in FIG. 7, the code code correction unit 30 can be realized by incorporating dedicated software into a general-purpose computer, and the phoneme code table 20 is stored in a storage device (accessible by this computer). (A memory or various storage media). The data input unit 10 can be realized by various input devices for inputting the synthesis instruction data D to the computer, and the audio reproduction unit 40 can be realized by a MIDI device for a computer.

【００６６】この音声合成装置を用いれば、いわゆる
「棒読み」の音声合成ではなく、音長や音程を自由に設
定した音声合成が可能になり、人間の発音に近い抑揚を
もった朗読音声を合成することも可能であるし、メロデ
ィーに合わせた歌声の合成も可能になる。しかも、合成
指示データＤには、実際の発音波形を含ませる必要がな
いため、合成指示データＤ自身の情報量は極めて低く抑
えることができる。したがって、合成指示データＤは、
一般のＭＩＤＩデータと同様に、通信回線を介して配布
するのに適しており、本発明に係る音声合成装置は、通
信カラオケの分野などでの利用が期待できる。あるい
は、合成指示データＤはバーコードなどで表現すること
もできるので、データ入力部１０に、バーコードリーダ
を設けるようにすれば、楽譜や歌詞を掲載した印刷物上
に、合成指示データＤをバーコードの形で印刷して配布
することも可能であり、この印刷物をファクシミリなど
の方法で伝送することも可能になる。By using this speech synthesizer, it is possible to perform a speech synthesis in which the pitch and pitch are freely set, instead of a so-called "stick reading" speech synthesis. It is also possible to synthesize singing voices according to the melody. In addition, since the synthesis instruction data D does not need to include an actual tone generation waveform, the information amount of the synthesis instruction data D itself can be extremely low. Therefore, the synthesis instruction data D is
Like general MIDI data, it is suitable for distribution via a communication line, and the speech synthesizer according to the present invention can be expected to be used in the field of communication karaoke. Alternatively, since the synthesis instruction data D can be expressed by a barcode or the like, if the data input unit 10 is provided with a barcode reader, the synthesis instruction data D can be displayed on a printed material containing music scores and lyrics. It is also possible to print and distribute in the form of a code, and it is also possible to transmit this printed matter by a method such as facsimile.

【００６７】[0067]

【発明の効果】以上のとおり本発明に係る音声合成方法
および音声合成装置によれば、人間の音声を効率良く、
かつ、音長や音程の自由度をもって合成することができ
るようになる。As described above, according to the speech synthesizing method and the speech synthesizing apparatus according to the present invention, human speech can be efficiently processed.
In addition, synthesis can be performed with a degree of freedom in pitch and pitch.

[Brief description of the drawings]

【図１】本発明に係る音声合成方法の前段部分となる音
素の符号化処理の基本概念を示す図である。FIG. 1 is a diagram showing a basic concept of a phoneme encoding process which is a former part of a speech synthesis method according to the present invention.

【図２】図１に示す符号化処理において、各単位区間ご
との代表周波数を示すノートナンバーを選択する方法を
示す図である。FIG. 2 is a diagram illustrating a method of selecting a note number indicating a representative frequency for each unit section in the encoding process illustrated in FIG.

【図３】図１に示す符号化処理において、４本のトラッ
クに収容された各符号データについて、母音に寄与する
符号データと子音に寄与する符号データとを分類した状
態を示す図である。FIG. 3 is a diagram showing a state in which code data contributing to vowels and code data contributing to consonants are classified for each code data contained in four tracks in the encoding process shown in FIG. 1;

【図４】母音に寄与する符号データと子音に寄与する符
号データとを、トラックごとに分類した状態を示す図で
ある。FIG. 4 is a diagram showing a state in which code data contributing to vowels and code data contributing to consonants are classified for each track.

【図５】本発明に係る音声合成方法において、合成対象
となる音素を示すための合成指示データのフォーマット
の一例を示す図である。FIG. 5 is a diagram showing an example of a format of synthesis instruction data for indicating a phoneme to be synthesized in the voice synthesis method according to the present invention.

【図６】図５に示すフォーマットで記述された具体的な
合成指示データを示す図である。FIG. 6 is a diagram showing specific combining instruction data described in the format shown in FIG. 5;

【図７】本発明の一実施形態に係る音声合成装置の基本
構成を示すブロック図である。FIG. 7 is a block diagram illustrating a basic configuration of a speech synthesis device according to an embodiment of the present invention.

[Explanation of symbols]

１〜８…符号データ（音符）１０…データ入力部２０…音素符号テーブル３０…符号コード補正部４０…音声再生部４１…制御部４２…音源部４３…アンプ部４４…スピーカＡ…フーリエ変換により得られる複素強度Ｄ…合成指示データｄ１〜ｄ９…単位区間Ｅ…実効強度ｆ…周波数Ｆｓ…サンプリング周波数Ｋ１〜Ｋ９…単位区間ｄ１〜ｄ９に対応する時間軸上の
位置Ｎ…ノートナンバーＮｐ（ｄ１，１）〜Ｎｐ（ｄ１，４）…単位区間ｄ１に
ついて選択されたノートナンバーＴ１〜Ｔ４…トラックＷ…発音波形1 to 8: Code data (note) 10: Data input unit 20: Phoneme code table 30: Code code correction unit 40: Voice reproduction unit 41: Control unit 42: Sound source unit 43: Amplifier unit 44: Speaker A: By Fourier transform Obtained complex intensity D: synthesis instruction data d1 to d9: unit interval E: effective intensity f: frequency Fs: sampling frequency K1 to K9: position on the time axis corresponding to unit interval d1 to d9 N: note number Np (d1 , 1) to Np (d1, 4) ... note numbers selected for unit section d1 T1 to T4 ... track W ... sounding waveform

Claims

[Claims]

1. A method for synthesizing a human voice corresponding to a specific phoneme, wherein a human pronunciation waveform is taken as data for one phoneme, and a plurality of unit sections are set on a time axis of the pronunciation waveform. It is necessary to perform a code code definition process for creating a code code indicating a representative frequency and its intensity included in a sound waveform in the unit section for each individual unit section and defining a code code group corresponding to the one phoneme. A phoneme code table in which code groups defined for each phoneme are recorded, and when a synthesis instruction for synthesizing a human voice corresponding to a specific phoneme is given, the phoneme By referring to a code table, a code group defined for the specific phoneme is extracted, and the extracted code group is reproduced using a predetermined sound source. Speech synthesis method characterized by synthesizing a human voice through the.

2. The speech synthesis method according to claim 1, wherein when setting a plurality of unit sections on the time axis of the sounding waveform,
A speech synthesis method characterized in that settings are made such that adjacent unit sections partially overlap on a time axis.

3. The speech synthesis method according to claim 1, wherein a plurality of P code codes are created by defining a plurality of P representative frequencies for one unit section, and these code codes are represented by P code. A speech synthesizing method characterized in that the code codes stored in the P tracks are simultaneously reproduced to obtain a P channel reproduced sound when the code codes are reproduced in separate tracks.

4. The speech synthesis method according to claim 1, wherein code groups defined for each phoneme are classified into code codes contributing to vowels and code codes contributing to consonants. Recorded in the phoneme code table, the synthesis instruction for the specific phoneme includes a parameter indicating the length of sound for generating the specific phoneme, and when performing reproduction based on the synthesis instruction, A voice synthesizing method, wherein a code length that contributes to a vowel among code codes extracted from a code table is corrected based on a parameter indicating the voice length.

5. The speech synthesis method according to claim 1, wherein a group of code codes defined for each phoneme is classified into a code code contributing to a vowel and a code code contributing to a consonant. It is recorded in a phoneme code table, and a synthesis instruction for a specific phoneme includes a parameter indicating a pitch at which the specific phoneme is pronounced. When performing reproduction based on the synthesis instruction, a phoneme code A voice synthesizing method characterized in that, among code codes extracted from a table, a code code contributing to a vowel is corrected in pitch based on a parameter indicating the pitch.

6. The speech synthesis method according to claim 4, wherein a plurality of P code codes are created by defining a plurality of P representative frequencies for one unit section, and these code codes are converted into frequency codes. For the code codes obtained for phonemes consisting of vowels only by separating them into P tracks by sorting on the basis of For code groups obtained for phonemes consisting only of consonants, all code codes contained in P tracks are used as code codes contributing to consonants, and code codes obtained for phonemes containing vowels and consonants are used. For the group, the code codes stored in the n tracks on the low frequency side are used as code codes contributing to vowels, and the remaining (P-n) Speech synthesis method characterized by the code code contained in the click and contribute code code consonants.

7. A speech synthesis method according to claim 4 or 5, wherein, for a code group obtained from a phoneme composed only of vowels, all code codes are code codes contributing to vowels, and a phoneme composed only of consonants is obtained. For the code group obtained for, all code codes are considered to be code codes contributing to consonants, and for the code code group obtained for phonemes including vowels and consonants, the phonemes are generated by changing the pitch. A plurality of types of sound waveforms are captured, a code code group is created for each of the plurality of sound waveforms, and a code code that indicates only a change within a predetermined allowable range among the plurality of code code groups created. A speech synthesizing method, wherein code codes contributing to consonants are used, and the remaining code codes are code codes contributing to vowels.

8. A speech synthesis method according to claim 4 or 5, wherein, for a code group obtained for a phoneme consisting only of vowels, all code codes are code codes contributing to vowels, and a phoneme consisting only of consonants is obtained. For the code group obtained for, all code codes are considered to be code codes contributing to consonants, and for the code group obtained for phonemes including vowels and consonants, a plurality of codes are generated by changing the pitch of the phoneme. A plurality of pronunciation waveforms are fetched, a code group is created for each of the plurality of pronunciation waveforms, and a code code showing only a change within a predetermined allowable range in the created plurality of code code groups is consonant. A speech synthesis method characterized by using code codes contributing to vowels and code codes contributing to vowels.

9. The speech synthesis method according to claim 1, wherein a representative frequency included in a sound waveform in a unit section is indicated by a note number, and an intensity of the representative frequency is indicated by a velocity. By representing the time corresponding to the section length of the unit section by the delta time, the MIDI format code data is created, and when reproducing the created MIDI format code data, the MIDI format has an acoustic waveform based on human voice. A speech synthesis method using a sound source.

10. The voice synthesizing method according to claim 1, wherein, for a plurality of adjacent unit sections, if there are code codes similar to each other under a predetermined condition, these similar code codes are replaced by: A speech synthesizing method characterized by performing a process of replacing with an integrated code code spanning a plurality of unit sections.

11. An apparatus for synthesizing a human voice corresponding to a specific phoneme, comprising: a data input unit for inputting synthesis instruction data for designating a specific phoneme and a sound length to be synthesized; For phonemes, a phoneme code table in which code groups for reproducing a pronunciation waveform of a human pronounced at a reference pitch are classified into code codes contributing to vowels and code codes contributing to consonants, and the phoneme code table is recorded. Based on the instruction data, a code group for a specific phoneme to be synthesized is read from the phoneme code table, and code codes that contribute to vowels in the read code group are read based on the synthesis instruction data. Using a code code correction unit for correcting the tone length, and using a predetermined sound source, reproduce the corrected code code group,
A voice synthesizer, comprising: a voice reproducing unit configured to generate a synthesized voice of a human voice.

12. A device for synthesizing a human voice corresponding to a specific phoneme, comprising: a data input unit for inputting synthesis instruction data for designating a specific phoneme to be synthesized and a pitch; A phoneme code table in which code groups for reproducing a human pronunciation waveform pronounced at the reference interval are classified into code codes contributing to vowels and code codes contributing to consonants, and the synthesis instruction data is recorded. , A code group for a specific phoneme to be synthesized is read from the phoneme code table, and a code code that contributes to a vowel in the read code group is read out of a pitch based on the synthesis instruction data. Using a code code correction unit for performing correction, and a predetermined sound source, reproducing the corrected code code group,
A voice synthesizer, comprising: a voice reproducing unit configured to generate a synthesized voice of a human voice.