CN1310209C

CN1310209C - Speech and music regeneration device

Info

Publication number: CN1310209C
Application number: CNB2004100474146A
Authority: CN
Inventors: 川岛隆宏
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-05-29
Filing date: 2004-05-28
Publication date: 2007-04-11
Anticipated expiration: 2024-05-28
Also published as: HK1069433A1; KR100612780B1; CN1573921A; TWI265718B; TW200427297A; KR20040103433A

Abstract

The voice/music reproducing device related to the present invention is made up of intermediate equipment such as regenerating synthesized voice signals, a sound source (20) and a loudspeaker for regenerating desired voice or melody according to the synthesized voice signals; wherein the intermediate equipment converts the script data (11) , user timbre parameters (12) and user phrase synthesis dictionary data (13) are combined, and regenerate synthetic speech signals etc. according to default timbre parameters (18) and default synthesis dictionary data (19). When an HV-script describing various events is used as scenario data, desired waveform data, music phrase data including note information, and formant frame data are reproduced in combination as appropriate according to the type of event.

Description

Speech and music reproduction device

技术领域technical field

本发明涉及语音和乐曲再生装置，特别是涉及通过语音合成再生特定的话的同时，把字符信息变换为语音或乐曲再生出来的语音和乐曲再生装置。The present invention relates to a voice and music reproduction device, in particular to a voice and music reproduction device which converts character information into voice or music reproduction while reproducing specific words through speech synthesis.

背景技术Background technique

以往，设计有把电子邮件等字符串(character string information)信息变换为语音输出的字符串语音变换装置。日本公开专利特开2001-7937号展示出现有的字符串语音变换装置之一例，其中以文节为单元来划分字符串信息，输出语音的同时，将其内容显示在显示器上。Conventionally, there have been designed character string-to-voice conversion devices that convert character string information such as e-mails into voice output. Japanese Laid-Open Patent Publication No. 2001-7937 shows an example of a conventional string-to-speech conversion device, in which string information is divided into units of text, and the content is displayed on a display while outputting voice.

公知的方法有再生对乐曲短句或语音短语等进行采样作成的波形数据(或采样数据)的方法，或由SMF(标准MIDI文件)或SMAF(合成音乐移动应用文件；synthetic music mobile application fi1e)等音符信息构成一个乐曲短句再再生该乐曲短句的方法。例如，日本公开专利特开2001-51688号中披露了一种电子邮件读取装置，能够把电子邮件中的字符串信息与乐音信息分离开，再各自发音再生出来。Known methods include a method of regenerating waveform data (or sampling data) made by sampling musical phrases or speech phrases, or by SMF (standard MIDI file) or SMAF (synthetic music mobile application file; synthetic music mobile application file) A method for reproducing the short sentence of the music after forming a short sentence of music by waiting for note information. For example, Japanese Laid-Open Patent Publication No. 2001-51688 discloses an e-mail reading device, which can separate the character string information and tone information in the e-mail, and reproduce the respective pronunciations.

但是，在现有的字符串语音变换装置中，由于是以文节(句子(clause)或短语)为单元来划分字符串信息，进行语音输出，所以该语音输出是发音单元(或字符单元)的语音的集合，在再生该发音单元的连接点时，与正常的说话语音相比，听取者就感觉不和谐。即，在现有的字符串语音变换装置中，就文节整体而言，不能以音质良好的语音变化并输出该音色，换言之，不能够输出接近于人说话语言的自然的语音。However, in the existing string-to-speech conversion device, since the string information is divided into units of text sections (sentences (clause) or phrases) for voice output, the voice output is a pronunciation unit (or character unit) When the connection point of the pronunciation unit is reproduced, compared with the normal speaking voice, the listener feels disharmony. That is, in the conventional character string speech conversion device, it is impossible to change and output the sound with good sound quality for the entire text, in other words, it is not possible to output the natural speech close to the human speech language.

为解决上述问题所考虑的方法是例如预先对每个文节(以下称为“短语”)的语音采样，并作为语音数据保存起来，再生时作为相应的语音波形进行输出。但是，为了提高语音输出的品质，这种方法必须提高采样频率，因此，就必须要保存大容量的语音数据，在便携式电话机(手提电话或移动电话)等存储容量比较有限的装置中就会伴随出现技术性的困难。A method considered to solve the above-mentioned problem is, for example, to sample the speech of each stanza (hereinafter referred to as "phrase") in advance, store it as speech data, and output it as a corresponding speech waveform during reproduction. But, in order to improve the quality of speech output, this method must increase sampling frequency, therefore, just must save the speech data of large capacity, in the relatively limited device of storage capacity such as portable telephone (handy phone or mobile phone), will Accompanied by technical difficulties.

另外，再生对乐曲或语音采样作成的波形数据的所述现有的方法或由SMF或SMAF等音符信息构成一个乐曲数据再再生该乐曲数据的现有方法中，乐曲或语音的再生定时并不是按文本文件形式记述的，因此，很难按照用户的意图进行基于字符串信息的语音再生与波形数据再生或乐曲数据再生之间的组合。In addition, in the conventional method of reproducing waveform data generated from music or voice samples, or the conventional method of reproducing the music data by constructing one piece of music data from note information such as SMF or SMAF, the reproduction timing of the music or voice is not Since it is described in the form of a text file, it is difficult to combine voice reproduction based on character string information with waveform data reproduction or music data reproduction according to the user's intention.

发明内容Contents of the invention

为解决上述的问题，本发明的目的在于提供一种语音再生装置，能够把由字符串信息等构成的所希望的文节(或短语)作成音质良好的语音，并使其音色变化，再生输出。In order to solve the above-mentioned problems, the object of the present invention is to provide a kind of voice reproduction device, which can make desired text sections (or phrases) made of character string information etc. .

本发明的其他目的在于提供一种语音·乐曲再生装置，用户能够简单地进行语音再生或波形数据再生与乐曲数据再生的组合，因此，能够进行忠实于用户的意图的语音和乐曲的再生。Another object of the present invention is to provide a voice/music reproducing device in which a user can easily perform voice reproduction or a combination of waveform data reproduction and music data reproduction, thereby enabling voice and music reproduction faithful to the user's intention.

按照本发明的语音再生装置把由对应于预先规定的发音单元的共振峰帧数据构成的数据库作为合成辞典数据存储起来，当赋予有关并列构成发音单元的字符串的信息时，用所述合成辞典数据进行语音合成。这里，把共振峰帧数据置换为任意的用户短语数据，赋予字符串信息时，用置换为该用户短语数据的合成辞典数据进行语音合成。把加工共振峰帧数据的音色参数附加在用户短语数据中。另外，在语音合成时，使用包含用户短语数据的规定的数据交换格式。该数据交换格式是例如SMAF文件格式，不仅包含用户短语数据还可以包含各种块或乐曲再生信息。According to the speech reproducing apparatus of the present invention, a database composed of formant frame data corresponding to predetermined utterance units is stored as synthesis dictionary data, and the synthesis dictionary is used when giving information about character strings forming the utterance units in parallel. data for speech synthesis. Here, when formant frame data is replaced with arbitrary user phrase data and character string information is given, speech synthesis is performed using synthesis dictionary data replaced with the user phrase data. The timbre parameters of the processed formant frame data are added to the user phrase data. In addition, at the time of speech synthesis, a predetermined data exchange format including user phrase data is used. This data exchange format is, for example, the SMAF file format, and may include not only user phrase data but also various pieces and music reproduction information.

具体地说，上述语音再生装置由保存对应于预先规定的发音单元的共振峰帧数据的缺省合成辞典数据、把该共振峰帧数据置换为用户短语数据的中间设备·应用程序·接口(API)、变换器、驱动器以及音源构成。这样，就能够把由字符串信息构成的所希望的短语作成音质良好的语音再生出来，而且能够使音色适宜地变化，进行再生。Specifically, the above-mentioned speech reproducing apparatus comprises default synthesis dictionary data storing formant frame data corresponding to predetermined pronunciation units, and an intermediate device, application program, and interface (API) for replacing the formant frame data with user phrase data. ), converter, driver and sound source. In this way, a desired phrase composed of character string information can be reproduced as a sound with good sound quality, and the sound can be reproduced with an appropriate change in timbre.

涉及本发明的语音·乐曲再生装置存储有记述了文字的发音或预先存储的发音用数据的再生指示的脚本数据(即HV-脚本)。根据该脚本数据生成对应于所述文字的语音信号并产生所希望的语音，同时生成对应于发音用数据的发音信号并产生所希望的语音或乐音。这里，发音用数据是例如由采样语音或乐曲所生成的波形数据构成，根据该波形数据生成合成发音信号。另外，在把发音用数据作为包含音符信息的乐曲数据的情况下，根据该乐曲数据生成对应于音符信息的乐音信号。另外，在存储了使所述文字的发音具有特征的共振峰控制参数(或共振峰帧数据)的情况下，根据该共振峰控制参数生成语音信号。所述脚本数据也可以由用户任意作成，这时，脚本数据采取通过文本输入作成的规定的文件形式。The audio/music reproducing device according to the present invention stores script data (that is, HV-script) in which the pronunciation of characters or reproduction instructions of data for pronunciation stored in advance are described. Based on the script data, a speech signal corresponding to the characters is generated to generate a desired speech, and a speech signal corresponding to speech data is generated to generate a desired speech or musical sound. Here, the data for utterance is composed of, for example, waveform data generated from sampled speech or music, and a synthesized utterance signal is generated based on the waveform data. Also, when the data for sound generation is music data including musical note information, a musical sound signal corresponding to the musical note information is generated based on the music data. Also, when formant control parameters (or formant frame data) that characterize the pronunciation of the above-mentioned characters are stored, an audio signal is generated based on the formant control parameters. The script data may be created arbitrarily by the user. In this case, the script data takes the form of a predetermined file created by text input.

具体地说，在解释记述在HV-脚本中的各种事件，并且该事件的类别表示波形数据的情况下，读出该波形数据并再生出来，另一方面，在事件的类别表示乐曲短句数据的情况下，进行该乐曲短句数据的再生处理。这时，根据乐曲短句数据中的时间信息读出并再生其音符数据。另外，与其他事件相对应，把用合成辞典数据输入的字符串变换为共振峰帧串，进行语音合成。这样，用户就能够把语音再生、波形数据再生和乐曲数据再生容易地组合起来进行。Specifically, when explaining various events described in the HV-script, and when the type of the event indicates waveform data, the waveform data is read out and reproduced. On the other hand, when the type of the event indicates a phrase In the case of data, the playback processing of the music phrase data is performed. At this time, the note data is read out and reproduced based on the time information in the music phrase data. In addition, corresponding to other events, a character string input by synthesis dictionary data is converted into a formant frame string, and speech synthesis is performed. In this way, the user can easily perform reproduction of voice, reproduction of waveform data, and reproduction of music data in combination.

附图说明Description of drawings

图1是表示本发明的第一实施例的语音再生装置的构成的框图。FIG. 1 is a block diagram showing the configuration of a speech reproducing apparatus according to a first embodiment of the present invention.

图2是发音单元与短语ID的分配关系图。FIG. 2 is a diagram showing the allocation relationship between pronunciation units and phrase IDs.

图3是短语合成辞典数据的内容例。FIG. 3 is an example of the content of phrase synthesis dictionary data.

图4是SMAF文件的格式例。FIG. 4 is an example of the format of a SMAF file.

图5是表示HV创作工具的一例的功能框图。FIG. 5 is a functional block diagram showing an example of an HV authoring tool.

图6是表示应用了语音再生装置的便携式通信终端的构成的框图。FIG. 6 is a block diagram showing the configuration of a portable communication terminal to which the voice reproduction device is applied.

图7是表示用户短语合成辞典数据的作成处理的流程图。FIG. 7 is a flowchart showing a process of creating user phrase synthesis dictionary data.

图8是表示用户短语合成辞典数据的再生处理的流程图。FIG. 8 is a flowchart showing a process of reproducing user phrase synthesis dictionary data.

图9是表示SMAF文件的作成处理的流程图。FIG. 9 is a flowchart showing a SMAF file creation process.

图10是表示SMAF文件的再生处理的流程图。Fig. 10 is a flowchart showing playback processing of SMAF files.

图11是表示本发明的第二实施例的语音·乐曲再生装置的构成框图。Fig. 11 is a block diagram showing the configuration of a voice/music reproducing apparatus according to a second embodiment of the present invention.

图12是各事件与波形数据以及乐曲短句数据之间的分配关系示例图。FIG. 12 is a diagram showing an example of the distribution relationship between each event, waveform data, and music phrase data.

图13是表示第二实施例的语音·乐曲再生处理的流程图。Fig. 13 is a flowchart showing speech/music reproduction processing in the second embodiment.

图14是表示具备第二实施例的语音·乐曲再生装置的便携式电话的构成的框图。Fig. 14 is a block diagram showing the configuration of a mobile phone provided with the audio/music reproducing device of the second embodiment.

图15是表示本发明的第三实施例的语音·乐曲再生装置的构成的框图。Fig. 15 is a block diagram showing the configuration of a voice/music reproducing apparatus according to a third embodiment of the present invention.

图16是表示图15所示的语音·乐曲再生装置的动作的流程图。Fig. 16 is a flowchart showing the operation of the speech/music reproducing device shown in Fig. 15 .

具体实施方式Detailed ways

参照附图详细说明本发明的实施例。Embodiments of the present invention will be described in detail with reference to the drawings.

即，图1所示的语音再生装置1具备应用软件(application software)14、中间设备API(中间设备应用程序接口；middleware application programinterface)15、变换器16、驱动器17、缺省音色参数(default tone colorparameter)18、缺省合成辞典数据(default composition dictionary date)19和音源20，输入脚本数据(script data)11、用户音色参数12和用户短语合成辞典数据(user phrase composition dictionary data)13(可变长度)来再生语音。That is, the voice reproduction device 1 shown in FIG. 1 has application software (application software) 14, middleware API (middleware application program interface; middleware application program interface) 15, converter 16, driver 17, default tone parameters (default tone colorparameter) 18, default composition dictionary data (default composition dictionary date) 19 and sound source 20, input script data (script data) 11, user timbre parameter 12 and user phrase composition dictionary data (user phrase composition dictionary data) 13 (variable length) to reproduce the speech.

语音再生装置1基本上采用根据按照用FM(frequency modulation)音源资源的CSM(composition sinusoidal modeling：复合正弦波模型)语音合成方式的共振峰合成(formant composition)再生语音的方法。在本实施例中，定义用户短语合成辞典数据13，语音再生装置1参照该用户短语合成辞典数据13对音色参数以音素(phoneme)为单元分配用户短语。在音色参数中这样地分配有用户短语合成辞典数据13的情况下，再生时，语音再生装置1把登录在缺省合成辞典数据内的音素置换为用户短语，然后根据该置换数据进行语音合成。而且，所谓上述的“音素”是发音的最小单元，在日本语的情况下，由元音(vowel)和辅音(consonant)构成。The speech reproduction device 1 basically adopts a method of reproducing speech by formant composition according to a CSM (composition sinusoidal modeling: composite sine wave model) speech synthesis method using FM (frequency modulation) sound source resources. In the present embodiment, user phrase synthesis dictionary data 13 is defined, and the voice reproduction device 1 assigns user phrases to tone color parameters in units of phonemes by referring to the user phrase synthesis dictionary data 13 . When the user phrase synthesis dictionary data 13 is assigned to the timbre parameters in this way, at the time of reproduction, the speech reproduction device 1 replaces the phonemes registered in the default synthesis dictionary data with user phrases, and then performs speech synthesis based on the replacement data. In addition, the above-mentioned "phoneme" is the smallest unit of pronunciation, and in the case of Japanese, it is composed of a vowel (vowel) and a consonant (consonant).

下面说明语音再生装置1的详细构成。Next, the detailed configuration of the speech reproducing device 1 will be described.

图1中，脚本数据11是定义用来再生“HV(human voice；人类声音)：用上述方法合成的语音”的数据格式的数据。即，脚本数据11表示包含含有韵律符号(intonation symbol)的发音字符串、用来设定发音的音的事件数据和用来控制上述应用软件14的事件数据等的用来进行语音合成的数据格式，为使由用户进行的手动输入变得容易而成为文本输入形式。In FIG. 1, scenario data 11 is data defining a data format for reproducing "HV (human voice): voice synthesized by the above method". That is, the script data 11 represents a data format for speech synthesis including a pronunciation character string including a prosody symbol (intonation symbol), event data for setting the sound of pronunciation, event data for controlling the above-mentioned application software 14, etc. , a form of text input to facilitate manual input by the user.

该脚本数据11中的数据格式的定义有语言依存性，可以是由各种各样的语言所作的定义，但是在本实施例中，采取用日本语所作的定义。The definition of the data format in the script data 11 is language-dependent and may be defined in various languages, but in this embodiment, the definition in Japanese is adopted.

用户短语合成辞典数据13和缺省合成辞典数据19通过用发音字符单元(例如，日本语的“あ”、“い”等)采样分析实际的人的声音，抽出8组共振峰频率、共振峰强度(formant level)和音调作为参数，并把该参数预先作成共振峰帧数据与发音字符单元对应起来，相当于保存在发音字符单元的数据库。用户短语合成辞典数据13是构建在中间设备外的数据库，用户可以对该数据库任意地登录共振峰帧数据，因此，可以经中间设备API15把用户短语合成辞典数据13的登录内容完全置换为缺省合成辞典数据19的保存内容。即，可以把缺省合成辞典数据19的内容完全置换为用户短语合成辞典数据13的内容。另一方面，缺省合成辞典数据19是构建在中间设备内的数据库。The user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 extract 8 groups of formant frequencies and formant frequencies by sampling and analyzing actual human voices with pronunciation character units (for example, "あ", "い" in Japanese, etc.). Intensity (formant level) and pitch are used as parameters, and the parameters are made into formant frame data in advance to correspond to the pronunciation character unit, which is equivalent to being stored in the database of the pronunciation character unit. The user phrase synthesis dictionary data 13 is a database built outside the intermediate device, and the user can arbitrarily log in the formant frame data to the database. Therefore, the registration content of the user phrase synthesis dictionary data 13 can be completely replaced by the default through the intermediate device API15 Stored contents of the synthesized dictionary data 19 . That is, the content of the default synthetic dictionary data 19 can be completely replaced with the content of the user phrase synthetic dictionary data 13 . On the other hand, the default synthetic dictionary data 19 is a database built in the intermediate device.

作为用户短语合成辞典数据13和缺省合成辞典数据19最好分别具备男性音质用和女性音质用两类，根据各帧的周期来改变语音再生装置1的语音输出，但是被登录在用户短语合成辞典数据13和缺省合成辞典数据19内的共振峰帧数据的帧周期被设定为例如20ms。The user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 are preferably provided with two types for male voice quality and female voice quality respectively, and the voice output of the voice reproduction device 1 is changed according to the cycle of each frame, but are registered in the user phrase synthesis The frame period of the formant frame data in the dictionary data 13 and the default composite dictionary data 19 is set to, for example, 20 ms.

用户音色参数12和缺省音色参数18是控制语音再生装置1的语音输出中的音质的参数群，即，用户音色参数12和缺省音色参数18可以进行例如8组共振频率和共振峰强度的变更(即，来自登录在用户短语合成辞典数据13和缺省合成辞典数据19内的共振频率和共振峰强度的变化量的指定)以及共振峰合成用的基本波形的指定，这样，就能够制作出各种各样的音色。The user timbre parameters 12 and the default timbre parameters 18 are parameter groups for controlling the sound quality in the voice output of the voice reproduction device 1, that is, the user timbre parameters 12 and the default timbre parameters 18 can be configured, for example, with 8 groups of resonance frequencies and formant strengths. Change (that is, designation of the change amount of the resonance frequency and formant intensity registered in the user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19) and the designation of the basic waveform for formant synthesis, thus, it is possible to create produce a variety of tones.

缺省音色参数18是作为缺省值预先设定在中间设备内的音色参数，用户音色参数12是可以由用户任意作成的参数，被设定保存在中间设备的外侧，是经所述中间设备API15来扩展缺省音色参数18的内容的参数。The default timbre parameter 18 is a timbre parameter pre-set in the intermediate device as a default value, and the user timbre parameter 12 is a parameter that can be arbitrarily made by the user, and is set and saved outside the intermediate device, and is passed through the intermediate device. API15 to extend the parameters of the content of the default timbre parameter 18.

应用软件14是用来再生脚本数据11的软件。The application software 14 is software for reproducing the scenario data 11 .

中间设备API15构成由软件构成的应用软件14和由中间设备构成的变换器16、驱动器17、缺省音色参数18以及缺省合成辞典数据19间的接口。The intermediate device API 15 forms an interface between the application software 14 composed of software and the converter 16, driver 17, default tone color parameters 18, and default synthesis dictionary data 19 composed of intermediate devices.

变换器16解释脚本数据11并利用驱动器17最终变换为连结帧数据而形成的共振峰帧数据串。The converter 16 interprets the script data 11 and uses the driver 17 to finally convert it into a formant frame data string formed by concatenating the frame data.

驱动器17根据包含在脚本数据11内的发音字符串和缺省合成辞典数据19生成共振峰帧数据串，并解释音色参数来加工该共振峰帧数据串。The driver 17 generates a formant frame data string based on the pronunciation character string included in the script data 11 and the default synthesis dictionary data 19, and processes the formant frame data string by interpreting the timbre parameters.

音源20输出对应于变换器16的输出数据的合成发音信号，把该合成发音信号输出到扬声器发出声音。The sound source 20 outputs a synthesized utterance signal corresponding to the output data of the converter 16, and outputs the synthesized utterance signal to a speaker to emit sound.

下面说明本实施例的语音再生装置1的技术特征。The technical features of the speech reproducing apparatus 1 of this embodiment will be described below.

用户音色参数12中包含有对任意发音单元分配用户短语合成辞典数据13存储的短语ID的参数。图2表示的是分配发音单元和短语ID的一例，这里，表示出音节(mora)与短语ID的分配关系。在日本语的情况下，所谓音节是“拍”的意思，例如相当于假名字符单元。The user timbre parameter 12 includes a parameter for assigning a phrase ID stored in the user phrase synthesis dictionary data 13 to any pronunciation unit. FIG. 2 shows an example of assigning pronunciation units and phrase IDs. Here, the assignment relationship between syllables (mora) and phrase IDs is shown. In the case of Japanese, the so-called syllable means "beat", and corresponds to, for example, a kana character unit.

通过把短语ID分配给每个发音单元，用户音色参数12中所指定的发音单元不是参照缺省合成辞典数据19而是参照用户短语合成辞典数据13被规定下来。在用户音色参数12中，最好能从一个音色参数中指定任意的发音单元数。By assigning a phrase ID to each utterance unit, the utterance unit specified in the user timbre parameter 12 is specified not with reference to the default synthesis dictionary data 19 but with reference to the user phrase synthesis dictionary data 13 . In the user timbre parameter 12, it is preferable that an arbitrary number of utterance units can be specified from one timbre parameter.

上述用户音色参数12中的每个发音单元的短语ID分配是本实施例的一个示例，只要对应于发音单元，也可以采用其他方法。The above-mentioned allocation of phrase IDs to each pronunciation unit in the user timbre parameters 12 is an example of this embodiment, as long as it corresponds to the pronunciation unit, other methods can also be used.

然后来说明用户短语合成辞典数据13的细节，图3表示的是用户短语合成辞典数据13的内容例。用户短语合成辞典数据13存储由8组共振频率、共振峰强度和音调构成的共振峰帧数据，图3中的所谓“短语”是指例如日本语的“おはよう”等，具有一种意义或按音节统一的语句，该“短语”的定义规模无须特别规定，也可以是单词、音节和文章等任意规模的语句。Next, details of the user phrase synthesis dictionary data 13 will be described, and FIG. 3 shows an example of the content of the user phrase synthesis dictionary data 13 . The user phrase synthesis dictionary data 13 stores formant frame data composed of 8 groups of resonance frequency, formant intensity and pitch. The so-called "phrase" in Fig. For sentences with unified syllables, the definition scale of the "phrase" does not need to be specially specified, and it can also be sentences of any scale such as words, syllables, and articles.

作为作成用户短语合成辞典数据13的工具，必须装载分析通常的声音文件(带有*.wav、*.aif等扩展名的文件)来生成由8组共振频率、共振峰强度和音调构成的共振峰帧数据的分析工具。As a tool for creating user phrase synthesis dictionary data 13, it is necessary to load and analyze normal sound files (files with extensions such as *.wav, *.aif) to generate resonances consisting of 8 sets of resonance frequencies, formant strengths, and pitches. Analysis tools for peak frame data.

脚本数据11中包含有指示音质变更的事件数据，但是可以用该事件数据来指定用户音色参数12。The script data 11 includes event data indicating a change in sound quality, but the user tone parameter 12 can be specified by using the event data.

例如，作为脚本数据11的记述例，在使用日本语的平假名和字母数字字符的情况下，可以设定“TJK12 みなさんX10あか”。该例中，“K”代表指定缺省音色参数18的事件数据，“X”代表指定用户音色参数12的事件数据。“K12”是从多种缺省音色参数中指定某个特定缺省音色参数的代码，“X10”是从多种用户音色参数中指定图2所示的用户音色参数的代码。For example, as a description example of the scenario data 11, when using Japanese hiragana characters and alphanumeric characters, "TJK12 みなさんX10あか" can be set. In this example, "K" represents event data designating the default timbre parameter 18, and "X" represents event data designating the user timbre parameter 12. "K12" is a code for specifying a specific default tone parameter from various default tone parameters, and "X10" is a code for specifying the user tone parameter shown in FIG. 2 from various user tone parameters.

在上述的例子中，被再生的合成语音为“みなさんこんにちは铃木です。”。这里，“みなさん”是参照缺省音色参数18和缺省合成辞典数据19再生的合成语音；“こんにちは”和“铃木です”是参照用户音色参数12和用户短语合成辞典数据13再生的合成语音。即，语句“みなさん”是从缺省合成辞典数据19中读出并再生有关“み”、“な”、“さ”、“ん”4个音素的共振峰帧数据的合成语音；语句“ニんにちは”和“铃木です”是从用户短语合成辞典数据13中读出并再生各自的短语单元的共振峰帧数据的合成语音。In the above-mentioned example, the synthesized speech to be reproduced is "minasankonnichiwaSuzuki desu.". Here, "みなさん" is a synthesized voice reproduced with reference to the default timbre parameters 18 and default synthesized dictionary data 19; That is, the sentence "みなさん" is a synthesized speech in which the formant frame data of the four phonemes "み", "な", "さ", and "ん" are read out and reproduced from the default synthetic dictionary data 19; "んにちは" and "Suzuki desu" are synthesized voices in which the formant frame data of the respective phrase units are read from the user phrase synthesis dictionary data 13 and reproduced.

在图2的示例中，虽然表示了“あ”、“い、”、“か”等3个发音单元，但是只要是能够用文本表记的字符和符号，都不作特别规定。在上述的示例中，用接续“X10”的发音符号“あ”表示语句“こんにちは”来发音，而用发音符号“か”表示语句“铃木です”来发音。因此，在上述发音例之后，在进行本来的“あ”的发音的情况下，只要插入使参照目标返回缺省合成辞典数据19的符号(例如，“X○○”、“○”内插入规定的数字等)就可以。In the example in FIG. 2 , although three pronunciation units such as "あ", "い", and "か" are shown, as long as they are characters and symbols that can be expressed in text, there are no special regulations. In the above example, the sentence "こんにちは" is pronounced with the diacritic "あ" following "X10", and the sentence "Suzuki desu" is pronounced with the diacritic "か". Therefore, after the above-mentioned pronunciation example, in the case of performing the original pronunciation of "あ", it is only necessary to insert a symbol that returns the reference target to the default synthetic dictionary data 19 (for example, "X○○", "○" inserts a specified numbers, etc.) will do.

下面参照图4来说明有关本实施例的语音再生装置1所用的音乐再生序列数据(SMAF：synthetic music mobile appliation format)的数据交换格式。图4表示SMAF文件格式，该SMAF是用音源表现音乐用的数据分配和用于相互利用的数据交换格式的一种，在便携式终端机(个人数字式助理(PDA)，个人计算机，蜂窝电话等)中，是用来再生多媒体内容的数据格式的规格。Next, the data exchange format of the music reproduction sequence data (SMAF: synthetic music mobile application format) used by the speech reproduction device 1 of the present embodiment will be described with reference to FIG. 4 . Fig. 4 shows the SMAF file format, and this SMAF is a kind of data distribution and the data exchange format that is used for mutual utilization with sound source expression music use, in portable terminal machine (Personal Digital Assistant (PDA), personal computer, cellular phone etc. ) is a specification of a data format for reproducing multimedia content.

图4所示的数据交换格式的SMAF文件30把叫做块(chunk)的数据单元作为基本结构。块由固定长度(8字节)的首部和任意长度的主体部构成，首部分为4字节的块ID和4字节的块大小两部分。块ID被用作块的识别符，块大小表示主体部的长度。SMAF文件30其本身及其所包含的各种数据也构成为整个块结构。The SMAF file 30 of the data interchange format shown in FIG. 4 has a data unit called a chunk as a basic structure. A block is composed of a fixed-length (8-byte) header and an arbitrary-length body, and the header is divided into two parts: a 4-byte block ID and a 4-byte block size. The block ID is used as an identifier of the block, and the block size indicates the length of the main body. The SMAF file 30 itself and various data contained therein also constitute the entire block structure.

如图4所示，SMAF文件30由内容信息块(contents info chunk)31、选择数据块(optional data chunk)32、乐谱音轨块(score track chunk)33以及HV块(HV chunk)36构成。As shown in Figure 4, SMAF file 30 is made up of content information block (contents info chunk) 31, selection data block (optional data chunk) 32, score track block (score track chunk) 33 and HV block (HV chunk) 36.

内容信息块31存储着有关SMAF文件30的各种管理信息，例如，存储内容的级别、种类、著作权信息、类型名、曲名、艺术家名、作词/作曲者名等信息。选择数据块32存储例如著作权信息、类型名、曲名、艺术家名、作词/作曲者名等信息。在SMAF文件30中，也不必设置操作数据块32。The content information block 31 stores various management information about the SMAF file 30, for example, information such as the level, type, copyright information, genre name, song name, artist name, lyricist/composer name, etc. of the stored content. The selection data block 32 stores information such as copyright information, genre name, song name, artist name, lyricist/composer name, and the like. In the SMAF file 30, it is not necessary to set the operation data block 32 either.

乐谱音轨块33是存储发送到音源的乐曲的序列音轨的块，包含创建数据块(setup data chunk)34(选项)和序列数据块(sequence data chunk)35。The musical score track block 33 is a block for storing the sequence track of the musical piece sent to the sound source, and includes a setup data chunk (setup data chunk) 34 (option) and a sequence data chunk (sequence data chunk) 35 .

创建数据块34是存储音源的音色数据等的块，一并存储专用信息(exclusive message)的语句，作为专用信息例如有音色参数登录信息。The created data block 34 is a block for storing timbre data of a sound source, etc., and also stores exclusive message (exclusive message) sentences, such as timbre parameter registration information as exclusive information.

序列数据块35是存储实际演奏数据的块，把决定脚本数据11的再生定时的HV音符开(HV note-on，‘HV’表示人类的声音(human voice))与其他的序列事件混起来存储。这里，HV及其以外的乐曲事件由该HV的频道指定来区别。The sequence data block 35 is a block for storing actual performance data, and stores HV note-on (HV note-on, 'HV' representing human voice) which determines the playback timing of the scenario data 11 mixed with other sequence events. . Here, HV and other music events are distinguished by the channel designation of the HV.

HV块36包含HV创建数据块(HV setup data chunk)37(选项)、HV用户短语辞典块(HV user phrase dictionary chunk)38(选项)和HV-S块39。The HV chunk 36 includes an HV setup data chunk (HV setup data chunk) 37 (option), an HV user phrase dictionary chunk (HV user phrase dictionary chunk) 38 (option), and an HV-S chunk 39 .

HV创建数据块37存储HV用户音色参数和用来指定HV用的频道的信息，HV-S块39存储HV脚本数据。The HV creation data block 37 stores HV user tone parameters and information for specifying channels for HV, and the HV-S block 39 stores HV scenario data.

HV用户短语辞典块38存储用户短语合成词典数据13的内容，在存储于HV创建数据块37中的HV用户音色参数内必须要有用来分配图2所示的音节和短语ID的参数。The HV user phrase dictionary block 38 stores the content of the user phrase synthesis dictionary data 13, and among the HV user timbre parameters stored in the HV creation data block 37, parameters for assigning syllables and phrase IDs shown in FIG. 2 must be included.

把图4所示的SMAF文件30适用于本实施例的音色参数中，就能够与乐曲同步地再生出合成语音(HV)，同时还能够再生用户短语合成词典数据13的内容。By applying the SMAF file 30 shown in FIG. 4 to the timbre parameters of this embodiment, the synthesized speech (HV) can be reproduced synchronously with the music, and the contents of the user phrase synthesis dictionary data 13 can also be reproduced.

然后参照图5说明用来作成图1所示的用户短语合成词典数据13和图4所示的SMAF文件30的工具即HV创作工具(HV authoring tool)。图5是表示HV创作工具的功能和规格例的框图。Then, the HV authoring tool (HV authoring tool), which is a tool for making the user phrase synthesis dictionary data 13 shown in FIG. 1 and the SMAF file 30 shown in FIG. 4 , is described with reference to FIG. 5 . FIG. 5 is a block diagram showing an example of functions and specifications of the HV authoring tool.

在作成SMAF文件30的情况下，HV创作工具42读入预先由MIDI(乐器数码接口；musical instrument digital interface)序列发生器作成的SMF文件(标准MIDI文件)41(包含决定HV的发音定时的音符开)，根据从HV脚本UI(HV脚本用户接口)44和HV语音编辑器(HV voice editor)45得到的信息进行向SMAF文件43(相当于前述SMAF文件30)的变换处理。In the case of making the SMAF file 30, the HV authoring tool 42 reads in the SMF file (standard MIDI file) 41 (including the note that determines the pronunciation timing of the HV) made by a MIDI (instrument digital interface; musical instrument digital interface) sequencer in advance. Open), according to the information obtained from the HV script UI (HV script user interface) 44 and the HV voice editor (HV voice editor) 45, the conversion process to the SMAF file 43 (equivalent to the aforementioned SMAF file 30) is performed.

HV语音编辑器45是具有编辑包含在HV用户音色文件48内的HV用户音色参数(相当于前述的用户音色参数12)的功能的编辑器。该HV语音编辑器45除编辑各种HV音色参数之外，还能够对任意的音节分配用户短语。The HV voice editor 45 is an editor having a function of editing HV user tone parameters (corresponding to the aforementioned user tone parameters 12 ) included in the HV user tone file 48 . This HV voice editor 45 can assign user phrases to arbitrary syllables in addition to editing various HV tone color parameters.

HV语音编辑器45的接口具有选择音节的菜单，并具有对该音节分配任意声音文件50的功能。用HV语音编辑器45的接口被分配的声音文件50由波形分析器46进行分析，由此来生成8组共振频率、共振峰强度和音调构成的共振峰帧数据。该共振峰帧数据可以作为个别文件(即，HV用户音色文件48、HV用户合成辞典文件49)输出输入。The interface of the HV voice editor 45 has a menu for selecting a syllable, and has a function of assigning an arbitrary sound file 50 to the syllable. The voice file 50 distributed by the interface of the HV voice editor 45 is analyzed by the waveform analyzer 46 to generate formant frame data consisting of eight sets of resonance frequencies, formant strengths, and pitches. The formant frame data can be output as individual files (ie, HV user tone color file 48, HV user synthesis dictionary file 49).

HV脚本UI44可以直接编辑HV脚本数据，该HV脚本数据也可以作为个别文件(即，HV脚本文件47)输入输出。另外，涉及本实施例的HV创作工具40，也可以仅由上述的HV创作工具42、HV脚本UI44、HV语音编辑器45和波形分析器46构成。The HV script UI 44 can directly edit the HV script data, and the HV script data can be input and output as individual files (ie, the HV script file 47 ). In addition, the HV authoring tool 40 according to this embodiment may be composed only of the above-mentioned HV authoring tool 42 , HV script UI 44 , HV voice editor 45 , and waveform analyzer 46 .

下面参照图6说明把本实施例的语音再生装置1适用于便携式通信终端的示例，图6是表示具备语音再生装置1的便携式通信终端60的构成的框图。Next, an example in which the voice reproduction device 1 of this embodiment is applied to a portable communication terminal will be described with reference to FIG. 6 .

便携式通信终端60例如是相当于便携式电话机的设备，设置有CPU61、ROM62、RAM63、显示部64、振动器65、输入部66、通信部67、天线68、语音处理部69、音源70、扬声器71以及总线72。CPU61进行便携式通信终端60总体的控制，ROM62存储各种通信控制程序和用来再生乐曲的程序等控制程序，同时存储各种常数数据等。The portable communication terminal 60 is, for example, a device equivalent to a mobile phone, and includes a CPU 61, a ROM 62, a RAM 63, a display unit 64, a vibrator 65, an input unit 66, a communication unit 67, an antenna 68, a speech processing unit 69, a sound source 70, and a speaker. 71 and bus 72. The CPU 61 performs overall control of the portable communication terminal 60, and the ROM 62 stores control programs such as various communication control programs and programs for playing back music, and various constant data and the like.

RAM63被用作工作区，同时存储乐曲文件和各种应用程序。显示部64例如由液晶显示装置(LCD)构成，振动器65在便携式电话机有来话呼叫时振动。输入部66由多个键等操作件构成，这些操作件根据用户的操作指示用户音色参数、用户短语合成辞典数据和HV脚本数据的登录处理。通信部67由调制解调器等构成，连接在天线上。RAM63 is used as a work area, storing music files and various applications at the same time. The display unit 64 is composed of, for example, a liquid crystal display (LCD), and the vibrator 65 vibrates when the mobile phone receives an incoming call. The input unit 66 is composed of operation elements such as a plurality of keys, and these operation elements instruct registration processing of user tone parameters, user phrase synthesis dictionary data, and HV script data in accordance with user operations. The communication unit 67 is composed of a modem or the like, and is connected to an antenna.

语音处理部69连接在送话器和受话扬声器(例如，扩音器和耳机；e.g.，microphone and earphone)上，具有为了进行通话而对语音信号进行编码和译码的功能。音源70根据存储在RAM63等的乐曲文件进行乐曲的再生，同时再生语音信号并输出到扬声器71。总线72是用来进行由CPU61、ROM62、RAM63、显示部64、振动器65、输入部66、通信部67、语音处理部69和音源70构成的便携式电话机的各构成要素之间的数据传送的传送路径。The voice processing unit 69 is connected to a microphone and a receiver speaker (for example, a loudspeaker and an earphone; e.g., microphone and earphone), and has a function of encoding and decoding voice signals for communication. The sound source 70 reproduces the music based on the music file stored in the RAM 63 or the like, and simultaneously reproduces the audio signal and outputs it to the speaker 71 . The bus 72 is used to carry out data transfer between the constituent elements of the cellular phone composed of the CPU 61, ROM 62, RAM 63, display unit 64, vibrator 65, input unit 66, communication unit 67, voice processing unit 69 and sound source 70. transmission path.

通信部67可以从规定的内容服务器(Contents Server)等下载HV脚本文件或图4所示的SMAF文件30，并存储在RAM63内。在ROM62中存储有图1所示的语音再生装置1的应用程序14和中间设备的程序。CPU61读出并启动该应用程序14和中间设备的程序。CPU61解释存储在RAM63内的HV脚本数据，并生成共振峰帧数据，把该共振峰帧数据送到音源70。The communication unit 67 can download the HV script file or the SMAF file 30 shown in FIG. 4 from a predetermined content server (Contents Server) or the like, and store it in the RAM 63. The ROM 62 stores the application program 14 of the speech reproduction device 1 shown in FIG. 1 and the program of the intermediate device. The CPU 61 reads and starts the application program 14 and the program of the intermediate device. The CPU 61 interprets the HV scenario data stored in the RAM 63 to generate formant frame data, and sends the formant frame data to the sound source 70 .

然后说明本实施例的语音再生装置1的动作。首先，说明用户短语合成辞典13的作成方法。图7是表示用户短语合成辞典13的作成方法的流程图。Next, the operation of the speech reproducing apparatus 1 of this embodiment will be described. First, a method of creating the user phrase synthesis dictionary 13 will be described. FIG. 7 is a flowchart showing a method of creating the user phrase synthesis dictionary 13 .

首先，在步骤S1，用图5所示的HV创作工具42选择参照用户短语合成辞典13的HV音色，启动HV语音编辑器45。然后用HV语音编辑器45选择使用的音节，并装上声音文件。这样，在步骤S2，HV语音编辑器45生成并输出用户短语辞典数据(相当于HV用户合成辞典文件49)。First, in step S1, the HV tone color referring to the user phrase synthesis dictionary 13 is selected by the HV creation tool 42 shown in FIG. 5, and the HV voice editor 45 is started. Then select the syllable to use with the HV voice editor 45, and load the sound file. Thus, in step S2, the HV voice editor 45 generates and outputs user phrase dictionary data (corresponding to the HV user synthesized dictionary file 49).

然后，用HV语音编辑器45编辑HV音色参数，接着，在步骤S3，HV语音编辑器45生成并输出用户音色参数(相当于用户音色文件48)。Then, the HV tone parameters are edited with the HV voice editor 45, and then, at step S3, the HV voice editor 45 generates and outputs user tone parameters (corresponding to the user tone file 48).

然后，用HV脚本UI44在HV脚本数据中记述指定相应的HV音色的音质变更事件，由此来记述想要再生的音节。接着，在步骤S4，HV脚本UI44生成并输出HV数据(相当于HV脚本文件47)。Then, a syllable to be reproduced is described by describing a sound quality change event specifying a corresponding HV tone in the HV scenario data using the HV scenario UI 44 . Next, in step S4, the HV scenario UI 44 generates and outputs HV data (corresponding to the HV scenario file 47).

然后，参照图8说明语音再生装置1中的用户短语合成辞典数据13的再生动作。图8是表示语音再生装置1中的用户短语合成辞典数据13的再生动作的流程图。Next, the playback operation of the user phrase synthesis dictionary data 13 in the speech playback device 1 will be described with reference to FIG. 8 . FIG. 8 is a flowchart showing the playback operation of the user phrase synthesis dictionary data 13 in the speech playback device 1 .

首先，在步骤S11，把用户音色参数12和用户短语合成辞典数据13登录到语音再生装置1的中间设备中。然后，把脚本数据11登录在语音再生装置1的中间设备中，在步骤S12开始HV脚本数据的再生。First, in step S11, the user tone parameter 12 and the user phrase synthesis dictionary data 13 are registered in the intermediate device of the speech reproduction apparatus 1 . Then, the scenario data 11 is registered in the intermediate device of the voice reproduction apparatus 1, and reproduction of the HV scenario data is started in step S12.

再生时，在步骤S13，监视指定用户音色参数12的音质变更事件(X事件)是否包含在脚本数据11中。At the time of reproduction, in step S13, it is monitored whether a sound quality change event (X event) specifying the user tone parameter 12 is included in the scenario data 11 or not.

在步骤S13，如果发现了音质变更事件，从用户音色参数12中探寻给音节分配的短语ID，并从用户短语合成辞典数据13中读出对应于该短语ID的数据，然后，在步骤S14，把HV驱动器管理的缺省合成辞典数据19内的相应的音节的辞典数据置换为用户短语合成辞典数据13。步骤S14的替换处理也可以在再生HV脚本数据之前进行。In step S13, if the voice quality change event is found, search for the phrase ID assigned to the syllable from the user's timbre parameter 12, and read out the data corresponding to the phrase ID from the user's phrase synthesis dictionary data 13, then, in step S14, The dictionary data of corresponding syllables in the default synthesized dictionary data 19 managed by the HV driver is replaced with the user phrase synthesized dictionary data 13 . The replacement process in step S14 may be performed before the HV scenario data is reproduced.

步骤S14结束之后，或者在步骤S13未发现音质变更事件的情况下，流程进到步骤S15，变换器16解释脚本数据11(当进行了步骤S14的处理时，该步骤S14的交换处理后的脚本数据)的音节，最后用HV驱动器变换为共振峰帧串数据。After step S14 ends, or under the situation that step S13 does not find sound quality change event, flow process advances to step S15, converter 16 interprets script data 11 (when carrying out the processing of step S14, the script after the exchanging process of this step S14 Data) syllables, and finally converted into formant frame string data with the HV driver.

在步骤S16，通过音源20再生由步骤S15变换得到的数据。In step S16, the data converted in step S15 is reproduced by the sound source 20 .

此后，进到步骤S17，判定脚本数据11的再生是否已经结束，在未结束的情况下，返回到步骤S13，另一方面，如果结束了，就结束图8所示的用户短语合成辞典数据13的再生处理。After this, proceed to step S17, judge whether the reproduction of script data 11 has ended, under the situation of not ending, return to step S13, on the other hand, if end, just end user phrase synthesizing dictionary data 13 shown in Fig. 8 regeneration treatment.

然后，参照图9说明图4所示的SMAF文件30的作成方法。图9是表示SMAF文件30的作成方法的流程图。Next, a method of creating the SMAF file 30 shown in FIG. 4 will be described with reference to FIG. 9 . FIG. 9 is a flowchart showing a method of creating the SMAF file 30 .

首先，按照图7所示的步骤，作成用户短语合成辞典数据13、用户音色参数12和脚本数据11(参照步骤S21)。First, according to the procedure shown in FIG. 7, user phrase synthesis dictionary data 13, user tone parameters 12, and scenario data 11 are created (see step S21).

然后，在步骤S22，作成包含控制乐曲数据和HV脚本数据的发音的事件的SMF文件41。Then, in step S22, the SMF file 41 including events for controlling the sounding of music data and HV scenario data is created.

接下来，将SMF文件41输入到图5所示的HV创作工具42，用该HV创作工具42把SMF文件41变换为SMAF文件43(相当于前述的SMAF文件30)(参照步骤S23)。Next, the SMF file 41 is input to the HV authoring tool 42 shown in FIG. 5, and the HV authoring tool 42 converts the SMF file 41 into a SMAF file 43 (equivalent to the aforementioned SMAF file 30) (refer to step S23).

然后，把在前述步骤S21作成的用户音色参数12输入到图4所示的SMAF文件30的HV块36内的HV创建数据块37，另外，将在步骤S21作成的用户短语合成词典数据13输入到SMAF文件30的HV块36内的HV用户短语辞典块38，就这样生成并输出输入SMAF文件30(参照步骤S24)。Then, the user's timbre parameter 12 made in the aforementioned step S21 is imported into the HV creation data block 37 in the HV block 36 of the SMAF file 30 shown in FIG. To the HV user phrase dictionary block 38 in the HV block 36 of the SMAF file 30, the input SMAF file 30 is thus created and output (refer to step S24).

接下来，参照图10说明SMAF文件30的再生处理，图10是SMAF文件30的再生处理流程图。Next, the playback processing of the SMAF file 30 will be described with reference to FIG. 10 , which is a flowchart of the playback processing of the SMAF file 30 .

首先，在步骤S31，把SMAF文件30登录到图1所示的语音再生装置1的中间设备内。这里，通常语音再生装置1把SMAF文件30内的乐曲数据部分登录在中间设备的乐曲再生部中，进行再生准备。First, in step S31, the SMAF file 30 is registered in the intermediate device of the speech reproducing apparatus 1 shown in FIG. 1 . Here, normally, the voice reproduction device 1 registers the music data portion in the SMAF file 30 in the music reproduction unit of the intermediate device, and prepares for reproduction.

在步骤S32，语音再生装置1判定HV块36是否包含在SMAF文件30内。In step S32, the speech reproducing apparatus 1 determines whether the HV block 36 is included in the SMAF file 30 or not.

在步骤S32的判定结果为“是”的情况下，流程进入到步骤S33，语音再生装置1解释HV块36的内容。If the result of determination in step S32 is "Yes", the flow proceeds to step S33, and the speech reproducing apparatus 1 interprets the content of the HV block 36 .

在步骤S34，语音再生装置1进行用户音色参数的登录、用户短语辞典数据的登录和HV脚本数据的登录。In step S34, the speech reproducing apparatus 1 registers the user's timbre parameters, registers the user's phrase dictionary data, and registers the HV scenario data.

在步骤S32的判定结果为“否”的情况下，或在步骤S34中的登录处理结束了的情况下，进到步骤S35，语音再生装置1进行该乐曲再生部内的块的解释。When the determination result in step S32 is "No", or when the registration process in step S34 is completed, the process proceeds to step S35, and the speech reproduction device 1 interprets the blocks in the music reproduction unit.

然后，语音再生装置1对应于“开始”信号，开始解释序列数据块35内的序列数据(即，实际演奏数据)，由此来进行乐曲再生(参照步骤S36)。Then, the speech reproducing device 1 starts interpreting the sequence data (that is, the actual performance data) in the sequence data block 35 in response to the "start" signal, thereby reproducing the music (refer to step S36).

在上述的乐曲再生中，语音再生装置1按顺序解释包含在序列数据内的事件，在该过程中，判定各事件是否相当于HV音符开(参照步骤S37)。In the music playback described above, the speech playback device 1 sequentially interprets the events included in the sequence data, and in this process, judges whether or not each event corresponds to HV note-on (see step S37).

在步骤S37的判定结果为“是”的情况下，流程进到步骤S38，语音再生装置1开始再生由HV音符开指定的HV块的HV脚本数据。If the result of determination in step S37 is "Yes", the flow proceeds to step S38, and the speech reproducing apparatus 1 starts reproducing the HV script data of the HV block designated by the HV note-on.

步骤S38结束之后，语音再生装置1进行图8所示的用户短语合成辞典数据的再生处理，即，在步骤S38中的HV脚本数据的再生中，语音再生装置1判定是否存在指定用户音色参数12的音质变更事件(X事件)(参照步骤S39)。After step S38 ends, speech reproduction device 1 carries out the regeneration processing of user phrase synthesis dictionary data shown in Figure 8, that is, in the reproduction of HV script data in step S38, speech reproduction device 1 judges whether there is the specified user's timbre parameter 12 The sound quality change event (X event) of (refer to step S39).

存在上述音质变更事件的情况下，即，步骤S39的判定结果为“是”的情况下，流程进到步骤S40，从用户音色参数12中探寻给音节分配的短语ID，并从用户短语合成辞典数据13读出对应于短语ID的数据，把HV驱动器管理的缺省合成辞典数据19内相应的音节的辞典数据置换为用户短语合成辞典数据。该步骤S40的替换处理也可以在再生HV脚本数据之前进行。In the case where there is the above-mentioned voice quality change event, that is, when the determination result of step S39 is "Yes", the process proceeds to step S40, where the phrase ID assigned to the syllable is searched from the user timbre parameter 12, and the phrase ID is synthesized from the user phrase synthesis dictionary. The data 13 reads the data corresponding to the phrase ID, and replaces the dictionary data of the corresponding syllable in the default synthesized dictionary data 19 managed by the HV driver with the user phrase synthesized dictionary data. The replacement processing in step S40 may be performed before the HV scenario data is reproduced.

步骤S40结束之后，或者在步骤S39未发现音质变更事件的情况下，流程进到步骤S41，变换器16解释脚本数据11的音节，用HV驱动器最终变换为共振峰帧串数据。After step S40 ends, or under the situation that no sound quality change event is found in step S39, the process proceeds to step S41, where the converter 16 interprets the syllables of the script data 11, and finally converts them into formant frame string data with the HV driver.

然后流程进到步骤S42，在音源20的HV再生部把在步骤S41中被变换的数据再生出来。Then, the flow proceeds to step S42, and the HV reproduction unit of the sound source 20 reproduces the data converted in step S41.

此后，流程进到步骤S43，语音再生装置1判定乐曲的再生是否已经结束。乐曲再生结束了的情况下，结束SMAF文件30的再生处理，另一方面，在乐曲再生未结束的情况下，流程返回到步骤S37。Thereafter, the flow proceeds to step S43, and the speech reproducing apparatus 1 judges whether or not the reproduction of the music has been completed. When the music reproduction is completed, the reproduction process of the SMAF file 30 is terminated. On the other hand, when the music reproduction is not completed, the flow returns to step S37.

在步骤S37，所述序列数据中的事件不是HV音符开的情况下，语音再生装置1就把该事件认作乐曲数据的一部分，并变换为音源再生事件数据(参照步骤S44)。In step S37, if the event in the sequence data is not HV note-on, the voice reproduction device 1 recognizes the event as part of the music data and converts it into sound source reproduction event data (see step S44).

然后流程进到步骤S45，语音再生装置1把在步骤S44被变换的数据在音源20的乐曲再生部再生出来。Then the flow proceeds to step S45, and the voice reproducing apparatus 1 reproduces the data converted in step S44 in the music reproduction part of the sound source 20.

如上所述，本实施例采用了使用FM音源的共振峰合成进行的语音再生方式，有如下3个优点。As described above, the present embodiment adopts the speech reproduction method using formant synthesis of FM sound source, which has the following three advantages.

(1)能够分配用户喜好的短语，即，不依存于固定辞典，就能够以更近似于喜好的音色的音色进行语音再生。(1) Phrases preferred by the user can be assigned, that is, voice reproduction can be performed with a timbre closer to a preferred timbre without depending on a fixed dictionary.

(2)因为把缺省合成辞典数据19的一部分置换为用户短语合成辞典数据13，所以能够避免在语音再生装置1中过大地增加数据容量。由于能够把缺省合成辞典数据19的一部分置换为任意的短语，所以能够实现按短语单元的发音，并能够消除现有的按发音单元的合成语音中产生的各发音单元之间的连接点中引起的听觉上的不和谐的感觉。(2) Since a part of the default synthesized dictionary data 19 is replaced with the user phrase synthesized dictionary data 13, it is possible to avoid an excessive increase in data capacity in the voice reproduction apparatus 1. Since a part of the default synthetic dictionary data 19 can be replaced with any phrase, pronunciation by phrase unit can be realized, and the connection points between each pronunciation unit generated in the existing synthesized speech by pronunciation unit can be eliminated. The sensation of auditory dissonance caused.

(3)由于能够在HV脚本数据中指定任意的短语，所以可以并用音节单元的语音合成和短语单元的语音发音。(3) Since arbitrary phrases can be specified in the HV script data, speech synthesis of syllable units and speech pronunciation of phrase units can be used together.

另外，按照本实施例，与再生预先对短语采样构成的波形数据的方法相比，能够实现按照共振峰强度的音色变化。虽然本实施例中的数据大小和品质都依存于帧率，但是与现有的由采样波形数据进行的方法相比，能够用远少的数据容量实现高品质的语音再生。因此，可以容易地把例如本实施例的语音再生装置1组装在便携式电话机等便携式通信终端中，因此，能够以高品质的语音再生电子邮件等的内容。In addition, according to the present embodiment, compared with the method of reproducing waveform data formed by sampling phrases in advance, it is possible to realize a change in timbre according to the formant intensity. Although the data size and quality in this embodiment depend on the frame rate, compared with the conventional method using sampled waveform data, it is possible to realize high-quality speech reproduction with a much smaller data capacity. Therefore, for example, the speech reproducing device 1 of this embodiment can be easily incorporated in a portable communication terminal such as a mobile phone, and therefore, contents such as e-mails can be reproduced with high-quality speech.

图11是表示本发明的第二实施例的语音·乐曲再生装置的构成的框图。这里，HV-脚本(即HV脚本数据)是相当于定义用来再生语音的格式的文件，是定义用来进行由包含韵律符号(即，指定音调等发音形态的符号)的发音字符串、发音的音的设定和对再生应用等的信息构成的语音合成的数据的文件，为了使由用户的作成变得更容易而用文本输入来作成。Fig. 11 is a block diagram showing the configuration of an audio/music reproducing apparatus according to a second embodiment of the present invention. Here, the HV-script (i.e., HV-script data) is equivalent to a file defining a format for reproducing speech, and is a file defining a pronunciation character string, pronunciation In order to make creation easier for the user, the file of speech synthesis data for the setting of the sound and the information for the reproduction application is created by text input.

HV-脚本由文本编辑器等应用软件来读入，只要能由文本进行编辑的文件形式来记述就行，作为一例，可列举出用文本编辑器作成的文本文件。在HV-脚本中有语言依存性，可以用各种语言定义，在本实施例中，HV-脚本用日本语来定义。The HV-script is read by application software such as a text editor, and may be described as long as it can be described in a file format that can be edited by text. As an example, a text file created by a text editor can be cited. HV-script is language-dependent and can be defined in various languages. In this embodiment, HV-script is defined in Japanese.

符号101表示HV-脚本播放器(HV-Script player)，用来控制HV-脚本的再生或停止等。这里，HV-脚本被登录在HV-脚本播放器101中并接受到其再生指示的情况下，HV-脚本播放器101开始解释该HV-脚本。然后，根据HV-脚本中记述的事件的种类对HV驱动器102、波形再生播放器104和短语再生播放器107中的某一个进行根据该事件的处理。Symbol 101 represents an HV-script player (HV-Script player), which is used to control reproduction or stop of the HV-script. Here, when an HV-scenario is registered in the HV-scenario player 101 and an instruction to reproduce the HV-scenario is received, the HV-scenario player 101 starts interpreting the HV-scenario. Then, according to the type of the event described in the HV-script, any one of the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 is processed according to the event.

HV驱动器102从未图示的ROM(智读存储器；read-only memory)中读出并参照合成辞典数据，人的语音具有依存于人体的构造(例如声带或口腔等的形状)的规定的共振峰(即，固有频谱)，合成辞典数据把有关语音的共振峰的参数与发音文字对应起来进行保存。合成辞典数据相当于把按发音文字单元(例如，日本语的“あ”、“い”等音素单元)对实际的声音进行采样及分析的结果得到的参数作为共振峰帧数据预先按发音文字单元存储起来的数据库。The HV driver 102 reads out from a ROM (read-only memory) not shown in the figure and refers to the synthesized dictionary data. The human voice has a predetermined resonance depending on the structure of the human body (such as the shape of the vocal cords or oral cavity). For peaks (that is, natural spectrum), the synthetic dictionary data stores the parameters of formants related to speech in correspondence with the pronunciation characters. Synthetic dictionary data is equivalent to using the parameters obtained by sampling and analyzing the actual sound for each phonetic character unit (for example, phoneme units such as "あ" and "い" in Japanese) as formant frame data in advance. stored database.

例如，前述的CSM(Composite Sinusoidal Modeling：复合正弦波模型)语音合成方式的情况下，合成辞典数据把8组共振频率、共振峰强度和音调等作为参数保存起来。这样的语音合成方式与采样语音作成的波形数据的再生方式相比，具有数据量非常小的优点。而且，在合成辞典数据中，还可以保存控制被再生的语音的音质的参数(例如，用来进行8组共振频率和共振峰强度的变更指定的参数等)。For example, in the case of the aforementioned CSM (Composite Sinusoidal Modeling: Composite Sine Wave Model) speech synthesis method, the synthetic dictionary data stores 8 sets of resonance frequencies, formant strengths, and pitches as parameters. Such a speech synthesis method has an advantage that the amount of data is very small compared with a reproduction method of waveform data generated by sampling speech. In addition, parameters for controlling the sound quality of the reproduced speech (for example, parameters for specifying changes to eight sets of resonance frequencies and formant strengths, etc.) may also be stored in the synthesized dictionary data.

HV驱动器102解释包含HV-脚本中的韵律符号的发音字符串等，并用合成辞典数据变换为共振峰帧串之后，输出到HV音源103。HV音源103根据从HV驱动器102输出的共振峰帧串生成发音信号并将其输出到加法器110。The HV driver 102 interprets pronunciation character strings and the like including prosodic symbols in the HV-script, converts them into formant frame strings using synthesized dictionary data, and outputs them to the HV sound source 103 . The HV sound source 103 generates an utterance signal from the formant frame sequence output from the HV driver 102 and outputs it to the adder 110 .

波形再生播放器104进行预先采样语音或乐曲以及模拟音等的波形数据的再生或停止等。符号105代表波形数据用RAM(波形数据随机存取存储器；waveform data random-access memory)，预先存储有缺省波形数据。用户可以经由登录API(登录应用程序接口；registration application programinterface)113把用户数据用RAM112中的用户波形数据存储在波形数据用RAM105内。如果波形再生播放器104接受来自HV-脚本播放器101的再生指示，则从波形数据用RAM105读出波形数据并输出到波形再生器106。波形再生器106根据从波形再生播放器104输出的波形数据生成发音信号并输出到加法器110。被采样的波形数据不限于PCM(脉冲编码调制；pulse-code modulation)方式，例如也可以采取MP3(移动图象专家组层面3；moving picture expertsgroup layer 3)方式的语音压缩格式。The waveform reproduction player 104 reproduces or stops waveform data such as pre-sampled speech, music, and analog sound. Symbol 105 represents a waveform data RAM (waveform data random-access memory), which stores default waveform data in advance. The user can store user waveform data in the user data RAM 112 in the waveform data RAM 105 via a registration API (registration application program interface) 113 . When the waveform reproduction player 104 receives a reproduction instruction from the HV-scenario player 101 , it reads the waveform data from the waveform data RAM 105 and outputs it to the waveform reproduction unit 106 . Waveform regenerator 106 generates an audio signal from the waveform data output from waveform regenerator 104 and outputs it to adder 110 . The sampled waveform data is not limited to PCM (pulse code modulation; pulse-code modulation), for example, the voice compression format of MP3 (moving picture experts group layer 3; moving picture experts group layer 3) mode can also be adopted.

短语再生播放器107进行乐曲短句数据(或乐曲数据)的再生或停止等。乐曲短句数据是SMF格式形式，由代表发音的声音的音调和音量等的音符信息和表示发音的声音的发音时间的时间信息构成。符号108代表乐曲短句数据用RAM，预先存储有缺省乐曲短句数据。用户可以经由登录API把用户数据用RAM112中的用户乐曲短句数据存储在乐曲短句数据用RAM108内。The phrase reproduction player 107 reproduces or stops music phrase data (or music data). The music phrase data is in the SMF format and is composed of note information representing the pitch and volume of the uttered sound, and time information representing the utterance time of the uttered sound. Reference numeral 108 represents a RAM for music phrase data, in which default music phrase data is stored in advance. The user can store user music phrase data in the user data RAM 112 in the music phrase data RAM 108 via the login API.

短语再生播放器107一旦接受来自HV-脚本播放器101的再生指示，就从乐曲短句数据用RAM108读出乐曲短句数据，并进行乐曲短句数据中的音符信息的时间管理，根据乐曲短句数据中记述的时间信息把音符信息输出到短语音源109。短语音源109根据短语再生播放器107输出的音符信息生成乐音信号，并输出到加法器110。作为短语音源109可以采用FM方式或PCM方式等，但是如果具有乐曲短句数据的再生功能，就不必限定该音源的方式。Phrase regeneration player 107 once accepts the reproduction instruction from HV-script player 101, just reads out the music phrase data with RAM108 from music phrase data, and carries out the time management of the note information in the music phrase data, according to the music phrase The time information described in the sentence data outputs the note information to the short speech source 109 . The short voice source 109 generates a tone signal based on the note information output from the phrase reproduction player 107 and outputs it to the adder 110 . The FM method, the PCM method, etc. can be used as the short sound source 109, but the sound source method need not be limited as long as the music phrase data reproduction function is provided.

加法器110把从HV音源103输出的发音信号、从波形再生器106输出的语音信号和从短语音源109输出的乐音信号合成起来，并把该合成信号输出到扬声器111。扬声器111根据加法器110的合成信号发出语音和/或乐音。Adder 110 synthesizes the utterance signal output from HV sound source 103 , the speech signal output from waveform regenerator 106 , and the musical tone signal output from short speech source 109 , and outputs the synthesized signal to speaker 111 . The speaker 111 emits speech and/or musical tones according to the synthesized signal of the adder 110 .

也可以在HV驱动器102、波形再生播放器104和短语再生播放器107中同时进行处理，由此使分别基于发音信号、语音信号和乐音信号(另外，也可以把语音信号和乐音信号统称为“声音信号”)的语音和乐曲同时发出声音来。或者，也可以用HV-脚本播放器101管理HV驱动器102、波形再生播放器104和短语再生播放器107的处理定时，同时再生基于各自的处理的语音和乐曲。在本实施例中，禁止由HV驱动器102、波形再生播放器104和短语再生播放器107的同时处理。另外，在图11中，为方便说明，分别设置各自的RAM作为波形数据用RAM105、乐曲短句数据用RAM108和用户数据用RAM112，但是也可以把这些功能分配到单一的RAM内的不同的存储区域。It is also possible to perform processing simultaneously in the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107, thereby making the sound signal based on the pronunciation signal, the voice signal and the tone signal (in addition, the voice signal and the tone signal can also be collectively referred to as "voice signal and tone signal"). Sound signal") voice and melody sound at the same time. Alternatively, the HV-script player 101 may manage the processing timing of the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107, and simultaneously reproduce speech and music based on the respective processes. In this embodiment, simultaneous processing by the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 is prohibited. In addition, in Fig. 11, for the convenience of explanation, separate RAMs are respectively provided as the RAM 105 for waveform data, the RAM 108 for music phrase data, and the RAM 112 for user data, but these functions can also be allocated to different storages in a single RAM. area.

图12示出用来再生HV-脚本记述的波形数据或乐曲短句数据(以下统称“声音数据”)的事件的定义例。事件的开头文字“D”的含义是缺省定义，“○”的含义是用户定义。作为各事件的类别，分配为波形或短语。把波形数据用RAM105预先存储的缺省波形数据或乐曲短句数据用RAM108预先存储的缺省乐曲短句数据分配到缺省定义(D0～D63)中，缺省定义中可以分配64个缺省波形数据和缺省乐曲短句数据。把用户任意作成的采样波形数据或乐曲短句数据分配到用户定义(○0～○63)中，用户定义中可以分配64个采样波形数据或乐曲短句数据。FIG. 12 shows a definition example of an event for reproducing waveform data or musical phrase data (hereinafter collectively referred to as "audio data") described in the HV-scenario. The initial word "D" of an event means default definition, and the meaning of "○" means user definition. As a category of each event, it is assigned as a waveform or a phrase. The default waveform data pre-stored in RAM105 for waveform data or the default music phrase data stored in RAM108 are allocated to the default definition (D0～D63). In the default definition, 64 default Waveform data and default song phrase data. Allocate sampled waveform data or music phrase data arbitrarily created by the user to the user definition (○0～○63), and 64 sampled waveform data or music phrase data can be allocated in the user definition.

图12所示的类别是波形数据的事件和表示与该事件代表的波形数据的关系的数据被预先存储在波形数据用RAM105内。另外，类别是短语的事件和表示与该事件代表的乐曲短句数据的关系的数据被预先存储在乐曲短句数据用RAM108内。在用户进行过用户数据用RAM112中的波形数据或乐曲短句数据的登录的情况下，把这些数据更新。An event whose category is waveform data shown in FIG. 12 and data indicating a relationship with the waveform data represented by the event are stored in the waveform data RAM 105 in advance. Incidentally, an event whose category is a phrase and data indicating a relationship with the music phrase data represented by the event are stored in the music phrase data RAM 108 in advance. When the user has registered the waveform data or music phrase data in the RAM 112 for user data, these data are updated.

作为HV-脚本例如记述为“TJK12みなさん○0です。D20”，开头记述的“TJK12”内的“T”是表示HV-脚本的开始的符号，“J”指定国家文字代码，这里表示HV-脚本用日本语来记述。“K12”是设定音质的符号，表示指定第12种音质。另外，“みなさん”和“です”用HV驱动器102来解释，从扬声器111发出“みなさん”和“です”那样的日本语的语音。在“みなさん”和“です”那样的发音字符串中包含有语调(或强弱)等表示发音状态的韵律符号的情况下，发出加了语调(或加了强弱的)的语音。As an HV-script, for example, "TJK12 みなさん○0です. D20" is described. "T" in "TJK12" described at the beginning is a symbol indicating the start of the HV-script, and "J" designates the country character code. Here, it means HV- The script is described in Japanese. "K12" is a symbol for setting the sound quality, indicating that the twelfth sound quality is designated. Also, “minasan” and “desu” are interpreted by the HV driver 102 , and Japanese voices such as “minasan” and “desu” are emitted from the speaker 111 . When a pronunciation character string such as "みなさん" and "です" includes a prosodic symbol indicating a pronunciation state such as intonation (or intensity), voice with added intonation (or intensity) is emitted.

在用户事件“○0”中，例如登录有对发出“铃木”声音的语音进行采样的波形数据。该用户事件“○0”用波形再生播放器104来解释，由此，从扬声器111发出“铃木”的语音。另外，在用户事件“○20”中登录有例如欢快的短的乐曲短句数据。该用户事件“○20”用短语再生播放器107解释，由此，从扬声器111发出欢快的乐曲音。这时，再生语音成为“みなさん铃木です”(乐曲短句再生时)，只有“铃木”部分再生波形数据。由波形数据的再生产生的语音的发音与由“みなさん”或“です”那样的发音单元的语音合成再生的音势相比，发音单元的连接点的再生就更加自然。另外，把叫做“铃木”的语句的发音作为特色的波形的再生就能够让用户有效地听到再生的语音。如上所述，由HV-脚本记述指定波形数据或乐曲短句数据的再生的事件，就能够任意指定波形数据或乐曲短句数据的再生定时。有关HV-脚本的记述的设定是所谓设计事项，不限定于上述的方法。In the user event "○0", for example, waveform data of a sampled voice uttering "Suzuki" is registered. The user event "○0" is interpreted by the waveform reproduction player 104, whereby the voice of "Suzuki" is emitted from the speaker 111. In addition, in the user event "○20", for example, cheerful short music phrase data is registered. This user event "○20" is interpreted by the phrase playback player 107, thereby emitting cheerful music sounds from the speaker 111. At this time, the reproduced voice becomes "みなさんSuzukiです" (when a phrase of the music is reproduced), and only the part of "Suzuki" is reproduced as waveform data. The pronunciation of the speech generated by the reproduction of the waveform data is more natural in the reproduction of the connection points of the pronunciation units than the tone reproduced by the speech synthesis of the pronunciation units such as "minasan" or "desu". In addition, the reproduction of the waveform featuring the pronunciation of the phrase called "Suzuki" enables the user to effectively hear the reproduced voice. As described above, by describing the event specifying the reproduction of the waveform data or the music phrase data by the HV-script, it is possible to arbitrarily designate the reproduction timing of the waveform data or the music phrase data. The setting of the description related to the HV-script is a so-called design matter, and is not limited to the method described above.

然后，用图13的流程图来说明涉及本实施例的语音·乐曲再生装置的动作。首先，用户用文本编辑器作成HV-脚本，并登录在所述HV-脚本播放器101内(参照步骤S101)。这时，如果存在由用户定义产生的波形数据或乐曲短句数据，登录API113就从用户数据用RAM112读入波形数据或乐曲短句数据。登录API113把波形数据存储在波形数据用RAM105内，而把乐曲短句数据存储在乐曲短句数据用RAM108内。Next, the operation of the speech/music reproducing apparatus according to this embodiment will be described using the flowchart of FIG. 13 . First, the user creates an HV-script using a text editor, and registers it in the HV-script player 101 (see step S101). At this time, if there is waveform data or music phrase data defined by the user, the login API 113 reads the waveform data or music phrase data from the user data RAM 112 . The login API 113 stores the waveform data in the waveform data RAM 105 and stores the music phrase data in the music phrase data RAM 108 .

一旦用户作出开始指示(步骤S103)，HV-脚本播放器101就开始解释HV-脚本(参照步骤S102)。HV-脚本播放器101判定HV-脚本中是否包含从“D”或“○”开始的事件(步骤S104)；在有从“D”或“○”开始的事件的输入时，判定其类别是否是波形数据(步骤S105)。如果该事件的类别是波形数据，则HV-脚本播放器101对波形再生播放器104指示该处理，波形再生播放器104从波形数据用RAM105读出接续“D”或“○”的号码的波形数据，并输出到波形再生器106(步骤S106)。波形再生器106根据该波形数据生成语音信号，经加法器110输出到扬声器111(步骤S107)。就这样，扬声器111发出相应的语音。Once the user gives a start instruction (step S103), the HV-script player 101 starts interpreting the HV-script (refer to step S102). HV-script player 101 judges whether the HV-script includes the event starting from "D" or "○" (step S104); is waveform data (step S105). If the type of the event is waveform data, the HV-script player 101 instructs the waveform reproduction player 104 to perform the processing, and the waveform reproduction player 104 reads out the waveforms of numbers following "D" or "○" from the waveform data RAM 105. data, and output to the waveform regenerator 106 (step S106). The waveform regenerator 106 generates a speech signal based on the waveform data, and outputs it to the speaker 111 through the adder 110 (step S107). In this way, the speaker 111 emits a corresponding voice.

另外，在步骤S105中事件类别不是波形数据的情况下，流程进到步骤S108，HV-脚本播放器101判定事件的类别是否是乐曲短句数据。在事件的类别是乐曲短句数据的情况下，HV-脚本播放器101对短语再生播放器107指示该处理。短语再生播放器107从乐曲短句数据用PAM108读出接续“D”或“○”的号码的波形数据，根据该乐曲短句数据中的时间信息把乐曲短句数据中的音符信息输出到短语音源109(参照步骤S109)。短语音源109根据该音符信息生成乐音信号，经加法器110输出到扬声器111(步骤S110)。就这样，扬声器111发出乐曲音。另外，在步骤S108，事件类别被判定为不是乐曲短句数据的情况下，本实施例的语音·乐曲再生装置认为是不能处理的事件，流程进到步骤S113。Also, if the event type is not waveform data in step S105, the flow proceeds to step S108, and the HV-scenario player 101 determines whether the event type is music phrase data. When the type of the event is musical phrase data, the HV-script player 101 instructs the phrase reproduction player 107 to perform the processing. Phrase reproduction player 107 reads out the waveform data of the numbers following "D" or "○" from the music phrase data with PAM108, and outputs the note information in the music phrase data to the phrase according to the time information in the music phrase data. The sound source 109 (refer to step S109). The short voice source 109 generates a musical tone signal according to the note information, and outputs it to the speaker 111 through the adder 110 (step S110). In this way, the speaker 111 emits music sounds. Also, in step S108, if the event type is judged not to be music phrase data, the speech/music playback device of this embodiment considers it an event that cannot be processed, and the flow proceeds to step S113.

在步骤S104，HV-脚本中未记述从“D”开始或从“○”开始的事件的情况下，HV-脚本播放器101对HV驱动器102指示其处理。HV驱动器102用合成辞典数据把字符串变换为共振峰帧串，输出到HV音源103(参照步骤S111)。HV音源103根据该共振峰帧串生成发音信号，并经加法器110输出到扬声器111(参照步骤S112)。就这样，扬声器111发出相应的语音。In step S104 , when no event starting from “D” or starting from “○” is described in the HV-script, the HV-script player 101 instructs the HV driver 102 to process it. The HV driver 102 converts the character string into a formant frame string using the synthesized dictionary data, and outputs it to the HV sound source 103 (see step S111). The HV sound source 103 generates a sounding signal according to the formant frame string, and outputs it to the speaker 111 through the adder 110 (refer to step S112). In this way, the speaker 111 emits a corresponding voice.

每当有关事件的处理结束时，HV-脚本播放器101判定到HV-脚本的最后记述为止是否结束了解释(步骤S113)。在还剩余有应解释的记述时，返回到步骤S104，另一方面，在HV-脚本的全部记述都解释完了的情况下，结束图13所示的语音·乐曲再生处理。The HV-scenario player 101 judges whether or not the interpretation has been completed until the last description of the HV-scenario every time the event-related processing ends (step S113 ). When there are still descriptions to be explained, the process returns to step S104. On the other hand, when all the descriptions of the HV-script have been explained, the audio/music reproduction process shown in FIG. 13 is terminated.

作为本实施例的HV-脚本的记述例所示出的“TJK12みなさん○0です。D20”的情况下，用事件“○0”定义的波形数据的发音结束之后，必须发出下一个语句“です”的语音。例如，在HV-脚本播放器101进行波形数据(或乐曲短句数据)的事件的解释的情况下，暂时缓期其下一个事件的再生，在结束由波形再生播放器104(或短语再生播放器107)进行的发音时，从该波形再生播放器104对HV-脚本播放器101输出表示发音结束的信号。In the case of "TJK12 みなさん○0です.D20" shown as the description example of the HV-script of this embodiment, after the sound of the waveform data defined by the event "○0" is finished, the next sentence "です" voice. For example, under the situation that HV-script player 101 carries out the interpretation of the event of wave data (or music short sentence data), temporarily suspend the reproduction of its next event, finish by wave reproduction player 104 (or phrase reproduction player) 107) When uttering is being performed, a signal indicating the end of utterance is output from the waveform reproduction player 104 to the HV-script player 101 .

另外，在允许由HV驱动器102、波形再生播放器104和短语再生播放器107同时进行再生处理的情况下，也可以用HV-脚本的记述来控制它们的再生处理。例如，在HV-脚本记述着“TJKl2みなさん○0 3です。D20”的情况下，用接续“○0”的“”(空格)和“3”表示设置规定的无音期间的事件，使得在发出“○0”表示的语句“铃木”的语音间由HV驱动器102再生的语音为无声的方式控制。在HV-脚本记述有“TJK12こんにちは。D20みなさん○03です。”的情况下，由“D20”指定的乐曲和叫做“みなさん铃木です”的语音同时发音。In addition, when the HV driver 102, the waveform playback player 104, and the phrase playback player 107 are allowed to simultaneously perform playback processing, these playback processing can also be controlled by describing the HV-script. For example, when the HV-script describes "TJKl2 みなさん○0 3です. D20", "" (space) and "3" following "○0" are used to represent the event of setting a specified silent period, so that in Control is performed so that the voice reproduced by the HV driver 102 is silent while the voice of the sentence "Suzuki" indicated by "○0" is uttered. In the case where "TJK12 こんにちは. D20 みなさん○03です." is described in the HV-script, the music specified by "D20" and the voice called "みなさんSuzukiです" are pronounced simultaneously.

图14是具备本实施例的语音·乐曲再生装置的便携式电话机的构成框图。这里，标号141代表控制便携式电话机的各部的CPU。标号142代表发送接收数据用的天线。标号143代表通信部，把发送用数据调制后输出到天线142，同时对由天线接收到的接收用数据进行解调。标号144代表语音处理部，在便携式电话机进行通话时把从通信部143输出的通话对方的语音数据变换为语音信号输出到近耳扬声器(或耳机，未图示)，同时把从送话器(未图示)输入的语音信号变换为语音数据输出到通信部143。Fig. 14 is a block diagram showing the configuration of a mobile phone equipped with the audio/music reproducing device of this embodiment. Here, reference numeral 141 denotes a CPU for controlling each part of the mobile phone. Reference numeral 142 denotes an antenna for transmitting and receiving data. Reference numeral 143 denotes a communication section, which modulates the data for transmission and outputs it to the antenna 142, and demodulates the data for reception received by the antenna. Reference numeral 144 represents a voice processing section, which converts the voice data of the other party output from the communication section 143 into a voice signal and outputs it to a near-ear speaker (or earphone, not shown) when the portable phone is talking. The input voice signal (not shown) is converted into voice data and output to the communication unit 143 .

标号145代表音源，具有与图11所示的HV音源103、波形再生器106和短语音源109同样的功能。标号146代表扬声器，发出所希望的语音或乐音。标号147代表由用户操作的操作部。标号148代表存储由HV-脚本或用户定义的波形数据和乐曲短句数据等的RAM。标号149代表存储CPU141执行的程序、以及合成辞典数据、缺省波形数据、缺省乐曲短句数据等的ROM。标号150代表显示部，把用户进行的操作结果或便携式电话机的状态等显示在画面上。标号151代表振动器，在便携式电话机有来话呼叫时接受来自CPU141的指示，产生振动。上述的各功能框经总线B相互连接起来。Reference numeral 145 denotes a sound source, which has the same functions as the HV sound source 103, waveform regenerator 106, and short speech source 109 shown in FIG. Reference numeral 146 represents a speaker for emitting desired voice or musical sound. Reference numeral 147 denotes an operation section operated by a user. Reference numeral 148 denotes a RAM for storing waveform data and music phrase data and the like defined by HV-script or the user. Reference numeral 149 denotes a ROM which stores programs executed by the CPU 141, and synthesized dictionary data, default waveform data, default music phrase data, and the like. Reference numeral 150 denotes a display unit, which displays the result of the user's operation or the status of the mobile phone on the screen. Reference numeral 151 denotes a vibrator, which generates vibration upon receiving an instruction from the CPU 141 when a call is received in the cellular phone. The above functional blocks are connected to each other via the bus B.

该便携式电话机具有从语音生成波形数据的功能，把从送话器输入的语音送到语音处理部144，并变换为波形数据，把该波形数据存储在RAM148内。在通过通信部143从WEB服务器下载乐曲短句数据时，就把该乐曲短句数据存储在RAM148内。This mobile phone has a function of generating waveform data from speech, sends the speech input from the microphone to speech processing unit 144, converts it into waveform data, and stores the waveform data in RAM 148 . When the music phrase data is downloaded from the WEB server via the communication unit 143 , the music phrase data is stored in the RAM 148 .

CPU141按照存储在ROM149内的程序，进行与图11所示的HV-脚本播放器101、HV驱动器102、波形再生播放器104和短语再生播放器107同样的处理。另外，CPU141还对从RAM148读出的记述在HV-脚本内的事件进行解释。在事件表示由语音合成进行发音的情况下，CPU141从ROM149读出并参照合成辞典数据，把记述在HV-脚本内的字符串变换为共振峰帧串并将其输出到音源145。CPU 141 performs the same processing as HV-script player 101, HV driver 102, waveform playback player 104, and phrase playback player 107 shown in FIG. In addition, the CPU 141 also interprets the events described in the HV-script read from the RAM 148 . When the event indicates pronunciation by speech synthesis, CPU 141 reads out from ROM 149 and refers to synthesized dictionary data, converts character strings described in the HV-script into formant frame strings, and outputs them to sound source 145 .

在事件表示波形数据的再生的情况下，CPU141从RAM148或ROM149中读出HV-脚本中的接续“D”或“○”的号码的波形数据并输出到音源145。在事件表示乐曲短句的再生的情况下，CPU141从RAM148或ROM149中读出HV-脚本中的接续“D”或“○”的号码的乐曲短句数据，并根据该乐曲短句数据中的时间信息把乐曲短句数据中的音符信息输出到音源145。When the event indicates reproduction of the waveform data, the CPU 141 reads out the waveform data of numbers following “D” or “◯” in the HV-script from the RAM 148 or the ROM 149 and outputs the waveform data to the sound source 145 . Under the situation that event represents the reproduction of music phrase, CPU141 reads the music phrase data of the number that continues "D" or "○" in the HV-script from RAM148 or ROM149, and according to the music phrase data in this music phrase data, The time information outputs the note information in the music phrase data to the sound source 145 .

音源145根据从CPU141输出的共振峰帧串生成合成发音信号并输出到扬声器146。并根据从CPU141输出的波形数据生成语音信号并输出到扬声器146。进一步，还根据从CPU141输出的乐曲短句数据生成乐音信号并输出到扬声器146。扬声器146根据合成发音信号、语音信号或乐音信号适宜地发出语音或乐音。Sound source 145 generates a synthesized speech signal from the formant frame string output from CPU 141 and outputs it to speaker 146 . Then, an audio signal is generated based on the waveform data output from the CPU 141 and output to the speaker 146 . Furthermore, a musical sound signal is generated based on the musical phrase data output from the CPU 141 and output to the speaker 146 . The speaker 146 emits speech or musical tones as appropriate based on the synthesized utterance signal, speech signal or musical tone signal.

一旦用户操作操作部147来启动对应于文本编辑的软件，则用户就能认识显示在显示部150画面上的内容的同时作成HV-脚本，能够把这样作成的HV-脚本存储在RAM148中。When the user operates the operation unit 147 to start software for text editing, the user can create an HV-script while recognizing the content displayed on the screen of the display unit 150 , and store the thus-created HV-script in the RAM 148 .

另外，可以把由用户作成的HV-脚本应用于来电呼叫铃声。这时，在便携式电话机有来电呼叫时使用HV-脚本的事宜作为设定信息，预先存储在RAM148内。即，通信部143经天线142接收从其他便携式电话机发送的呼叫信息时，通信部143对CPU141通知有来电。接受了来电通知的CPU141从RAM148读出设定信息，再从RAM148读出该设定信息表示的HV-脚本，并开始其解释。以后的处理如前所述。即，扬声器146按照HV-脚本内记述的事件的类别发出语音或乐音。In addition, HV-scripts made by users can be applied to incoming call ringtones. At this time, the matter of using the HV-script when the cellular phone receives an incoming call is stored in RAM 148 as setting information. That is, when communication unit 143 receives call information transmitted from another mobile phone via antenna 142 , communication unit 143 notifies CPU 141 of an incoming call. The CPU 141 that has received the notification of the incoming call reads out the setting information from the RAM 148, reads out the HV-script indicated by the setting information from the RAM 148, and starts its interpretation. Subsequent processing is as described above. That is, the speaker 146 emits a voice or a musical sound according to the type of the event described in the HV-script.

另外，用户还可以把HV-脚本添加在电子邮件内，发送到其他终端。另外，CPU141也可以按照HV-脚本形式解释电子邮件的文本本身，接受用户的指示后，根据电子邮件中的记述向语音处理部144输出该HV-脚本的再生指示。HV-脚本播放器101、HV驱动器102、波形再生播放器104和短语再生播放器107的全部功能不必由CPU141来负担。例如也可以由音源145来负担上述功能的任意项功能。In addition, users can add HV-scripts to e-mails and send them to other terminals. In addition, CPU 141 may interpret the text itself of the e-mail in HV-script format, and after receiving an instruction from the user, output the HV-script playback instruction to speech processing unit 144 according to the description in the e-mail. All functions of the HV-script player 101, the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 need not be borne by the CPU 141. For example, any one of the above-mentioned functions may be performed by the sound source 145 .

本实施例的适用对象并不局限于便携式电话机(cellular phone)，例如适用于PHS(personal handphone system：(个人手持电话系统)日本国注册商标)或PDA(个人数字助理)等便携式终端，进行前述的语音和乐曲再生。The applicable objects of the present embodiment are not limited to cellular phones, for example, it is suitable for portable terminals such as PHS (personal handphone system: (personal handy phone system) registered trademark in Japan) or PDA (personal digital assistant), The aforementioned voice and music are reproduced.

另外，作为本实施例的灵活应用例，在便携式电话机等便携式移动终端中可以输入由用户作成的HV-脚本，由此，一般用户不仅可以容易地作成语音合成用的文字，而且能够容易地作成用来再生定型的采样波形数据或乐曲短句数据的HV-脚本。另外，在发送和接收用的便携式终端中具备本实施例的语音·乐曲再生装置的情况下，用户能够操作便携式移动终端，把HV-脚本添加在电子邮件内进行发送接收。这样，用接收方的便携式移动终端接收到的电子邮件不仅可以适宜地再生语音合成用的文字，而且可以适宜地再生定型的采样数据或乐曲短句数据。另外，还可以将使用HV-脚本的语音和乐曲的再生作为来电呼叫铃声利用。In addition, as a flexible application example of this embodiment, an HV-script created by a user can be input into a portable mobile terminal such as a mobile phone, so that a general user can not only easily create text for speech synthesis, but also easily An HV-script for reproducing stereotyped sampled waveform data or musical phrase data is created. In addition, when the audio/music reproducing device of this embodiment is provided in the portable terminal for sending and receiving, the user can operate the portable mobile terminal to attach HV-script to e-mail and send and receive it. In this way, not only text for speech synthesis but also stereotyped sample data or music phrase data can be appropriately reproduced from an e-mail received by the recipient's mobile terminal. In addition, it is also possible to use HV-script voice and reproduction of music as ringtones for incoming calls.

然后，参照图15和图16说明本发明的第三实施例的语音·乐曲再生装置。第三实施例是将前述的第一实施例与第二实施例组合起来构成的，中间设备实施HV再生、波形再生和乐曲短句再生，音源根据这3种数据产生合成发音信号，把来自3个系统的信号合成起来输出到扬声器。Next, a voice/music reproducing apparatus according to a third embodiment of the present invention will be described with reference to FIGS. 15 and 16. FIG. The third embodiment is formed by combining the aforementioned first embodiment and the second embodiment. The intermediate equipment implements HV regeneration, waveform regeneration and music phrase regeneration. The sound source produces a synthetic pronunciation signal according to these 3 kinds of data. The signals of the two systems are synthesized and output to the speaker.

这里，图15是以图1的结构为基本，将图11的部分结构组合起来的构成，标号211～219对应于图l的标号11～19，标号303～313对应于图11的标号103～113。即，图11的用户数据用RAM通过用户数据API连接在图1的中间设备上，在中间设备内，追加了图11的波形再生播放器和短语再生播放器，该波形再生播放器和短语再生播放器分别与波形数据用RAM和乐曲短句数据用RAM连接。另外，应用软件兼有作为图11的HV脚本播放器的功能，HV脚本通过中间设备APl根据记述在HV-脚本的事件的种类向HV变换、波形再生播放器、乐曲短句播放器的任一个指示处理。另外，音源兼具有图11的HV音源、波形发生器、以及乐曲短句音源的三种功能，他们的各输出信号通过加法器被合成并在扬声器发音。另外，图15所示的各构成要素的动作与图1以及图11所示的对应构成要素的动作相同，故省略其详细说明。Here, FIG. 15 is based on the structure of FIG. 1 and combines the partial structures of FIG. 11. The symbols 211-219 correspond to the symbols 11-19 in FIG. 113. That is, the RAM for user data in FIG. 11 is connected to the intermediate device in FIG. 1 through the user data API, and the waveform reproduction player and the phrase reproduction player in FIG. 11 are added in the intermediate device. The waveform reproduction player and the phrase reproduction player The player is respectively connected to the RAM for waveform data and the RAM for music phrase data. In addition, the application software also functions as the HV scenario player in FIG. 11, and the HV scenario is converted to HV through the intermediate device AP1 according to the type of event described in the HV-script, any of a waveform reproduction player and a music phrase player. Indicates processing. In addition, the sound source has the three functions of the HV sound source, the waveform generator, and the music phrase sound source in Fig. 11, and their respective output signals are synthesized by the adder and uttered by the speaker. Note that the operation of each component shown in FIG. 15 is the same as that of the corresponding component shown in FIGS. 1 and 11 , and thus detailed description thereof will be omitted.

图16是表示图15所示的语音·乐器再生装置的动作的流程图。其以图l3所示的流程图为基本，追加了图8所示的流程图的一部分，标号S211～S216对应于图8的标号S11～S16，标号S304～S310，S312，S313对应于图13的标号S104～S110，S112，S113。即，图13的步骤S104中的判断结果为“否”的情况下，执行与图8的步骤S13、S14、S15相同的处理，接着在步骤S312中执行HV音源再生处理。就这样，通过输入一个HV-脚本可以进行根据HV音源的语音的再生、根据波形再生播放器的波形数据的再生、以及可以进行基于根据乐曲短句再生播放器的音符信息的乐曲短句的再生。另外，图16所示的各步骤的处理与图8和图13相同，所以省略起详细说明。Fig. 16 is a flowchart showing the operation of the voice/musical instrument reproduction device shown in Fig. 15 . It is based on the flow chart shown in Figure 13, and a part of the flow chart shown in Figure 8 is added, the labels S211-S216 correspond to the labels S11-S16 in Figure 8, and the labels S304-S310, S312, S313 correspond to Figure 13 The labels S104～S110, S112, S113. That is, when the determination result in step S104 of FIG. 13 is "No", the same processing as steps S13, S14, and S15 of FIG. 8 is executed, and then HV sound source reproduction processing is executed in step S312. In this way, by inputting an HV-script, it is possible to reproduce the voice of the HV sound source, the reproduction of the wave data of the player based on the waveform reproduction, and the reproduction of the music phrase based on the note information of the music phrase reproduction player. . In addition, the processing of each step shown in FIG. 16 is the same as that in FIG. 8 and FIG. 13 , so detailed description thereof will be omitted.

最后，说明前述的实施例中所使用的韵律符号。例如，记述在HV-脚本内的“は^3じま$

^ま$5し＞10た。”就是在“はじま

ました”(即发音的字符串)内附加规定的音调(イソトネ一シヨソ)而进行语音合成起来的，这里，“^”、“$”、“＞”等相当于韵律符号。对该韵律符号后面的文字(在韵律符号后紧接数值的情况下，接续在该数值后面的文字)附加规定的仰扬(语调)。Finally, the prosodic symbols used in the foregoing embodiments will be described. For example, "は^3じま$ described in the HV-script

^ま$5し＞10た. "is in" はじま

ました" (that is, a character string of pronunciation) is added with a prescribed tone (イソトネ一シヨソ) to perform speech synthesis. Here, "^", "$", ">" and the like are equivalent to prosodic symbols. The prosody symbols The subsequent character (in the case of the rhythmic symbol immediately following the numerical value, the character following the numerical value) is given a predetermined uplift (intonation).

具体地说，“^”表示发音中音调提高，“$”表示发音中音调降低，“＞”表示发音中音量下降，根据这些符号进行语音合成。在韵律符号后紧接数值的情况下，该数值用来指定附加的语调的变化量。例如，语句“は^3じま”的情况下，“は”按标准音调和音量发音，在发音中把“じ”提高音调“3”的量，后面的“ま”按被提高的音调发音。Specifically, "^" indicates that the pitch is raised during the pronunciation, "$" indicates that the pitch is lowered during the pronunciation, and ">" indicates that the volume of the pronunciation is decreased, and speech synthesis is performed according to these symbols. In the case where the prosodic symbol is followed by a numerical value, the numerical value is used to specify the amount of change in the additional intonation. For example, in the case of the sentence "は^3じま", "は" is pronounced according to the standard pitch and volume, and "じ" is raised by the pitch "3" in the pronunciation, and the following "ま" is pronounced according to the raised pitch .

这样，在包含在发音的语言内的文字中附加规定的语调(或音调)的情况下，在该文字跟前记述上述的韵律符号(还有表示音调变化量的数值)。上述的韵律符号用来控制发音中的音调或音量，但是并不限于此，例如也可以使用控制音质或速度的符号。把这样的符号附加在HV-脚本中就能够很合适地体现出音调等发音状态。In this way, when a predetermined intonation (or pitch) is added to a character included in the uttered language, the above-mentioned prosodic symbol (and also a numerical value indicating the amount of pitch change) is described before the character. The above-mentioned prosodic symbols are used to control pitch or volume in pronunciation, but are not limited thereto. For example, symbols for controlling tone quality or speed may also be used. By adding such a symbol to the HV-script, the state of pronunciation such as pitch can be properly reflected.

另外，本发明并不限定于上述的实施例，发明范围内的变更均被包含在本发明之中。In addition, this invention is not limited to the said Example, The change within the range of invention is included in this invention.

Claims

1. A voice reproducing device comprises a storage device, a login device, an input device and a voice synthesis device; wherein,

the storage device stores synthetic dictionary data in which formant frame data corresponding to a phonetic character indicating a predetermined phonetic unit is associated with the phonetic character and stored in advance;

the registration device registers user phrase data indicating formant frame data used in place of formant frame data corresponding to a spoken word stored in the synthetic dictionary data into the user dictionary data in accordance with an operation by a user;

the input device inputs script data including a character string composed of a plurality of pronounced characters and event data indicating a replacement of formant frame data corresponding to at least a portion of the pronounced characters of the character string;

the speech synthesis device interprets the inputted scenario data, reads formant frame data from the synthesis dictionary data based on uttered characters other than at least a part of the character string, reads the user phrase data from the user dictionary data based on the event data and the part of the character string, and generates a synthesized speech based on the read formant frame data and the read user phrase data.

2. The voice reproducing device according to claim 1, further comprising music reproducing means for reproducing music based on the music reproduction information;

inputting a data exchange format in the input device, the data exchange format being an information structure including music reproduction information for reproducing music and music reproduction information containing the script data and the user phrase data, and synchronously reproducing music reproduction based on the music reproduction information and voice reproduction based on the voice reproduction information;

the music reproduction device reproduces the music reproduction information included in the data exchange format;

the voice synthesizing device reproduces the voice reproduction information included in the data exchange format.

3. A portable terminal device comprising the speech reproducing device according to claim 1 or 2.

4. A voice reproducing device is composed of a first storage device for storing voice data, a second storage device for storing script data, a reproducing instruction device, a synthesized pronunciation signal generating device, a voice signal generating device and a synthesized voice generating device; wherein,

the script data describes a character string composed of a pronunciation character representing a predetermined pronunciation unit and event data instructing reproduction of the voice data;

the reproduction instructing means reads the scenario data from the second storage means, instructs the sound generation based on a character string in the scenario data, and instructs reproduction of the sound data based on event data in the scenario data;

a synthesized speech signal generation device that performs speech synthesis based on a speech instruction of the character string from the reproduction instruction device and generates a synthesized speech signal;

the audio signal generating means reads the audio data from the first storage means in accordance with the reproduction instruction of the audio data from the reproduction instructing means, and generates an audio signal based on the audio data;

the synthesized speech device generates a synthesized speech from the synthesized utterance signal, and generates a sound from the sound signal.

5. The speech reproducing apparatus according to claim 4, wherein the sound data is waveform data generated by sampling a predetermined sound.

6. The speech reproducing apparatus according to claim 4, wherein the sound data is music data including note information indicating a pitch and a volume of a sound to be sounded.

7. The speech reproducing apparatus according to claim 4, wherein the synthesized speech signal generating means stores formant control parameters characterizing the speech of characters, and performs speech synthesis using the formant control parameters corresponding to character strings in the script data.

8. The speech reproducing apparatus according to any one of claims 4 to 7, wherein the script data is described in a file made of text data.

9. The speech reproducing apparatus according to claim 4, wherein synthesis dictionary data is provided which associates formant frame data corresponding to a phonetic character indicating a predetermined phonetic unit with the phonetic character and which is stored in advance;

registering user phrase data indicating other formant frame data used in place of formant frame data corresponding to a spoken word stored in the synthetic dictionary data in user dictionary data in accordance with an operation by a user;

when the script data includes event data indicating replacement of formant frame data corresponding to at least a part of phonetic characters of the character string, the synthesized phonetic signal generating device reads formant frame data from the synthesized dictionary data based on phonetic characters other than the part of phonetic characters, reads the user phrase data from the user dictionary data based on the event data and the part of character string, and generates the synthesized phonetic signal based on the read formant frame data and the read user phrase data.

10. A portable terminal device comprising the speech reproducing device according to any one of claims 4 to 9.